Text-to-Speech
Transformers
Safetensors
lfm2
text-generation

Logo

KaniTTS

License

A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications.

Overview

KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency.

Key Specifications:

  • Model Size: 370M parameters
  • Sample Rate: 22kHz
  • Languages: English, German, Chinese, Korean, Arabic, Spanish
  • License: Apache 2.0

Performance

Nvidia RTX 5080 Benchmarks:

  • Latency: ~1 second to generate 15 seconds of audio
  • Memory: 2GB GPU VRAM
  • Quality Metrics: MOS 4.3/5 (naturalness), WER <5% (accuracy)

Pretraining:

  • Dataset: ~80k hours from LibriTTS, Common Voice, and Emilia
  • Hardware: 8x H100 GPUs, 45 hours training time on Lambda AI

Voices Datasets

Voices:

  • david — David, English (British)
  • puck — Puck, English (Gemini)
  • kore — Kore, English (Gemini)
  • andrew — Andrew, English
  • jenny — Jenny, English (Irish)
  • simon — Simon, English
  • katie — Katie, English
  • seulgi — Seulgi, Korean
  • bert — Bert, German
  • thorsten — Thorsten, German (Hessisch)
  • maria — Maria, Spanish
  • mei — Mei, Chinese (Cantonese)
  • ming — Ming, Chinese (Shanghai OpenAI)
  • karim — Karim, Arabic
  • nur — Nur, Arabic

Quickstart: Install from PyPI & Run Inference

It’s a lightweight so you can install, load a model, and speak in minutes. Designed for quick starts and simple workflows—no heavy setup, just pip install and run. More detailes...

Install

pip install kani-tts
pip install -U "transformers==4.57.1" # for LFM2 !!!

Quick Start

from kani_tts import KaniTTS

model = KaniTTS('nineninesix/kani-tts-370m')

# Generate audio from text
audio, text = model("Hello, world!")

# Save to file (requires soundfile)
model.save_audio(audio, "output.wav")

Working with Multi-Speaker Models

This model support multiple speakers. You can check if your model supports speakers and select a specific voice:

from kani_tts import KaniTTS

model = KaniTTS('nineninesix/kani-tts-370m')

# Check if model supports multiple speakers
print(f"Model type: {model.status}")  # 'singlspeaker' or 'multispeaker'

# Display available speakers (pretty formatted)
model.show_speakers()

# Or access the speaker list directly
print(model.speaker_list)  # ['andrew', 'katie', ...]

# Generate audio with a specific speaker
audio, text = model("Hello, world!", speaker_id="andrew")

Custom Configuration

from kani_tts import KaniTTS

model = KaniTTS(
    'nineninesix/kani-tts-370m',
    temperature=0.7,           # Control randomness (default: 1.0)
    top_p=0.9,                 # Nucleus sampling (default: 0.95)
    max_new_tokens=2000,       # Max audio length (default: 1200)
    repetition_penalty=1.2,    # Prevent repetition (default: 1.1)
    suppress_logs=True,        # Suppress library logs (default: True)
    show_info=True,            # Show model info on init (default: True)
)

audio, text = model("Your text here")

Playing Audio in Jupyter Notebooks

You can listen to generated audio directly in Jupyter notebooks or IPython:

from kani_tts import KaniTTS
from IPython.display import Audio as aplay

model = KaniTTS('nineninesix/kani-tts-370m')
audio, text = model("Hello, world!")

# Play audio in notebook
aplay(audio, rate=model.sample_rate)

Audio Examples

Text Audio
I do believe Marsellus Wallace, MY husband, YOUR boss, told you to take me out and do WHATEVER I WANTED.
What do we say to the god of death? Not today!
What do you call a lawyer with an IQ of 60? Your honor
You mean, let me understand this cause, you know maybe it's me, it's a little fucked up maybe, but I'm funny how, I mean funny like I'm a clown, I amuse you?

Use Cases

  • Conversational AI: Real-time speech for chatbots and virtual assistants
  • Edge/Server Deployment: Resource-efficient inference on affordable hardware
  • Accessibility: Screen readers and language learning applications
  • Research: Fine-tuning for specific voices, accents, or emotions

Limitations

  • Performance degrades with inputs exceeding 2000 tokens
  • Limited expressivity without fine-tuning for specific emotions
  • May inherit biases from training data in prosody or pronunciation
  • Optimized primarily for English; other languages may require additional training

Optimization Tips

  • Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec
  • Batch Processing: Use batches of 8-16 for high-throughput scenarios
  • Hardware: Optimized for NVIDIA Blackwell architecture GPUs

Resources

Models:

Examples:

Links:

Acknowledgments

Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing.

Responsible Use

Prohibited activities include:

  • Illegal content or harmful, threatening, defamatory, or obscene material
  • Hate speech, harassment, or incitement of violence
  • Generating false or misleading information
  • Impersonating individuals without consent
  • Malicious activities such as spamming, phishing, or fraud

By using this model, you agree to comply with these restrictions and all applicable laws.

Contact

Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible.

Citation

@misc {sb_2025,
    author       = { SB },
    title        = { gemini-flash-2.0-speech },
    year         = 2025,
    url          = { https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech },
    doi          = { 10.57967/hf/4237 },
    publisher    = { Hugging Face }
}
@misc{toyin2025arvoicemultispeakerdatasetarabic,
      title={ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis}, 
      author={Hawau Olamide Toyin and Rufael Marew and Humaid Alblooshi and Samar M. Magdy and Hanan Aldarmaki},
      year={2025},
      eprint={2505.20506},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20506}, 
}
@misc {thorsten_müller_2024,
    author       = { {Thorsten Müller} },
    title        = { TV-44kHz-Full (Revision ff427ec) },
    year         = 2024,
    url          = { https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full },
    doi          = { 10.57967/hf/3290 },
    publisher    = { Hugging Face }
}
@misc{carlosmenaciempiessfem2019,
      title={CIEMPIESS FEM CORPUS: Audio and Transcripts of Female Speakers in Spanish.}, 
      ldc_catalog_no={LDC2019S07},
      DOI={https://doi.org/10.35111/xdx5-n815},
      author={Hernandez Mena, Carlos Daniel},
      journal={Linguistic Data Consortium, Philadelphia},
      year={2019},
      url={https://catalog.ldc.upenn.edu/LDC2019S07},
}
Downloads last month
4,347
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for nineninesix/kani-tts-370m

Finetuned
(4)
this model
Finetunes
2 models
Quantizations
3 models

Spaces using nineninesix/kani-tts-370m 3