TReconLM
TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
Model Variants
Pretrained Models (Fixed Length)
model_seq_len_60.pt(60nt)model_seq_len_110.pt(110nt)model_seq_len_180.pt(180nt)
Pretrained Models (Variable Length)
model_var_len_50_120.pt(50-120nt)
Fine-tuned Models
finetuned_noisy_dna_len60.pt(60nt, Noisy-DNA dataset)finetuned_microsoft_dna_len110.pt(110nt, Microsoft DNA dataset)finetuned_chandak_len117.pt(117nt, Chandak dataset)
All models support reconstruction from cluster sizes between 2 and 10.
How to Use
Tutorial notebooks are available in our GitHub repository under tutorial/:
quick_start.ipynb: Run inference on synthetic datasets from HuggingFacecustom_data.ipynb: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
The test datasets used in the notebooks can be downloaded from Hugging Face.
Training Details
- Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
- Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
- Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets).
For full experimental details, see our paper.
Limitations
Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (model_var_len_50_120.pt) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.