YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

TReconLM

TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.

Model Variants

Pretrained Models (Fixed Length)

  • model_seq_len_60.pt (60nt)
  • model_seq_len_110.pt (110nt)
  • model_seq_len_180.pt (180nt)

Pretrained Models (Variable Length)

  • model_var_len_50_120.pt (50-120nt)

Fine-tuned Models

All models support reconstruction from cluster sizes between 2 and 10.

How to Use

Tutorial notebooks are available in our GitHub repository under tutorial/:

  • quick_start.ipynb: Run inference on synthetic datasets from HuggingFace
  • custom_data.ipynb: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)

The test datasets used in the notebooks can be downloaded from Hugging Face.

Training Details

  • Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
  • Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
  • Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets).

For full experimental details, see our paper.

Limitations

Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (model_var_len_50_120.pt) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support