Training pipeline and dataset details for replication?

by Kukedlc - opened 4 days ago

4 days ago

Hey, great work on Maya1! I read through the model card - really solid technical choices with the SNAC codec and Llama backbone.
I'm looking to replicate this for Spanish and while the preprocessing pipeline is well documented (MFA alignment, MinHash-LSH dedup, etc.), there are some key details missing:

Training scripts/code: Is there a training repo available? The model card only includes the vLLM inference script.
Dataset scale: How many hours of audio did you use for:

Pretraining (internet-scale corpus)
Supervised fine-tuning (curated studio recordings)

Annotation process: For the SFT dataset with 20+ emotion tags - did you use manual annotation, automatic labeling, or a mix? Any tooling you can share?
Training hyperparameters: Learning rate schedules, batch sizes, number of epochs/steps for both pretraining and SFT phases?
Voice description generation: How did you create the natural language descriptions for the curated dataset? Manual writing or some automated process?

I understand some parts might be proprietary, but any guidance on scale/process would be incredibly helpful for replicating this for underserved languages.
Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment