BatonVoice: An Operationalist Framework for Controllable Speech Synthesis

This is the official implementation of the paper: BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs.

🎵 Abstract

We propose a new paradigm inspired by "operationalism" that decouples instruction understanding from speech generation.

We introduce BatonVoice, a framework where a Large Language Model (LLM) acts as a "conductor". The conductor's role is to understand nuanced user instructions and generate a detailed, textual "plan". This plan consists of explicit, word-level vocal features (e.g., pitch, energy, speaking rate).

A separate, specialized TTS model, the "orchestra", then executes this plan, generating the final speech directly from these precise features. To realize this component, we developed BatonTTS, a 1.7B parameter TTS model trained specifically for this task, which uses Qwen3-1.7B as the backbone and speech tokenizer of CosyVoice2.

🏛️ Framework Overview

The BatonVoice framework operates in a simple yet powerful sequence:

[User Instruction] ➡️ [LLM (Conductor)] ➡️ [Textual Plan (Features)] ➡️ [BatonTTS (Orchestra)] ➡️ [Speech Output]

This decoupling allows for unprecedented control and expressiveness, as the complex task of interpretation is handled by a powerful LLM, while the TTS model focuses solely on high-fidelity audio generation based on explicit guidance.

Demo Video

🚀 Getting Started

Core Principle: Word-Level Feature Control

The core of our framework is the ability to control the synthesized speech through word-level acoustic features. This means you can fine-tune the output by adjusting the specific numerical values for each word or segment.

Recommended Workflow

For the best results, we highly recommend using a powerful, instruction-following LLM to generate the initial feature plan. This significantly reduces the manual effort required.

Generate a Feature Template with an LLM: Use a powerful LLM like Gemini 1.5 Pro to generate a feature plan based on your text and a descriptive prompt (e.g., "in a happy and excited tone").
- For detailed examples of how to structure these prompts, please refer to our client implementations: openrouter_gemini_client.py and gradio_tts_interface.py.
(Optional) Manually Fine-Tune the Features: Review the LLM-generated features. You can manually adjust the values for specific words or phrases to achieve the perfect delivery. This is where the true power of BatonVoice lies.
Synthesize Speech with BatonTTS: Feed the final feature plan into the BatonTTS model to generate the audio.

Alternative Method (Less Recommended)

You can also use BatonTTS in a text-only mode to generate both the features and the speech. However, due to the limitations of a smaller model, the generated features often lack variation, resulting in a monotonous voice. We strongly suggest using the LLM-driven workflow for expressive results.

⚙️ Understanding the Features

You can control the speech output by adjusting the following features in the plan.

Feature	Description
`pitch`	The fundamental frequency (F0) of the voice for the segment. Higher values mean a higher-pitched voice.
`pitch_slope`	The rate of change of pitch within the segment. Positive values indicate a rising intonation.
`energy_rms`	The root mean square energy, corresponding to the loudness or volume of the segment.
`energy_slope`	The rate of change of energy. Can be used to create crescendo or decrescendo effects.
`spectral_centroid`	Relates to the "brightness" of the sound. Higher values often sound clearer or sharper.

A Special Feature: Word Segmentation

The word field and the structure of the feature list itself provide powerful control over the rhythm and pacing of the speech.

Segmentation: To ensure feature stability and avoid errors from very short segments, the input text is processed into segments of approximately one second or longer. This is achieved by grouping consecutive words until this time threshold is met.

This has two important implications:

Speaking Rate: The number of words in a segment's 'word' field implicitly indicates the local speaking rate. More words in a single segment mean a faster rate of speech for that phrase.
Pauses: The boundaries between dictionaries in the list can suggest potential pause locations in the synthesized speech. You can create a pause by splitting a sentence into more segments.

✨ Examples

Let's see how to generate features for the sentence: "Wow, you really did a great job." using Gemini 2.5 Pro with different emotional instructions.

Example 1: Happy Tone

# Prompt: "Please speak in a happy tone."
text = "Wow, you really did a great job."

feature_plan_happy = [{"word": "Wow, you really","pitch_mean": 360,"pitch_slope": 95,"energy_rms": 0.016,"energy_slope": 60,"spectral_centroid": 2650},{"word": "did a great job.","pitch_mean": 330,"pitch_slope": -80,"energy_rms": 0.014,"energy_slope": -50,"spectral_centroid": 2400}]

🎵 Audio Output: Listen to happy.wav

Example 2: Sarcastic Tone

# Prompt: "Please speak in a sarcastic tone."
text = "Wow, you really did a great job."

feature_plan_sarcastic = [{"word": "wow", "pitch_mean": 271, "pitch_slope": 6, "energy_rms": 0.009, "energy_slope": -4, "spectral_centroid": 2144}, {"word": "you realy", "pitch_mean": 270, "pitch_slope": 195, "energy_rms": 0.01, "energy_slope": 8, "spectral_centroid": 1403}, {"word": "did a great", "pitch_mean": 287, "pitch_slope": 152, "energy_rms": 0.009, "energy_slope": -15, "spectral_centroid": 1920}, {"word": "job", "pitch_mean": 166, "pitch_slope": -20, "energy_rms": 0.004, "energy_slope": -66, "spectral_centroid": 1881}]

🎵 Audio Output: Listen to sarcastic.wav

Acknowledgments

Qwen3: Powerful LLM Backbone
CosyVoice2: Advanced TTS model from FunAudioLLM
Matcha-TTS: High-quality TTS architecture
Whisper: Speech recognition capabilities
Wav2Vec2: Word-level alignment features

Note: For research purposes only. Do not use for commercial or production purposes.

Downloads last month: 542

Safetensors

Model size

2B params

Tensor type

F32

Model tree for Yue-Wang/BatonTTS-1.7B

Quantizations

1 model