Training pipeline and dataset details for replication?

#6
by Kukedlc - opened

Hey, great work on Maya1! I read through the model card - really solid technical choices with the SNAC codec and Llama backbone.
I'm looking to replicate this for Spanish and while the preprocessing pipeline is well documented (MFA alignment, MinHash-LSH dedup, etc.), there are some key details missing:

Training scripts/code: Is there a training repo available? The model card only includes the vLLM inference script.
Dataset scale: How many hours of audio did you use for:

Pretraining (internet-scale corpus)
Supervised fine-tuning (curated studio recordings)

Annotation process: For the SFT dataset with 20+ emotion tags - did you use manual annotation, automatic labeling, or a mix? Any tooling you can share?
Training hyperparameters: Learning rate schedules, batch sizes, number of epochs/steps for both pretraining and SFT phases?
Voice description generation: How did you create the natural language descriptions for the curated dataset? Manual writing or some automated process?

I understand some parts might be proprietary, but any guidance on scale/process would be incredibly helpful for replicating this for underserved languages.
Thanks!

Sign up or log in to comment