YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

TReconLM

TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.

Model Variants

Pretrained Models (Fixed Length)

model_seq_len_60.pt (60nt)
model_seq_len_110.pt (110nt)
model_seq_len_180.pt (180nt)

Pretrained Models (Variable Length)

model_var_len_50_120.pt (50-120nt)

Fine-tuned Models

finetuned_noisy_dna_len60.pt (60nt, Noisy-DNA dataset)
finetuned_microsoft_dna_len110.pt (110nt, Microsoft DNA dataset)
finetuned_chandak_len117.pt (117nt, Chandak dataset)

All models support reconstruction from cluster sizes between 2 and 10.

How to Use

Tutorial notebooks are available in our GitHub repository under tutorial/:

quick_start.ipynb: Run inference on synthetic datasets from HuggingFace
custom_data.ipynb: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)

The test datasets used in the notebooks can be downloaded from Hugging Face.

Training Details

Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets).

For full experimental details, see our paper.

Limitations

Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (model_var_len_50_120.pt) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support