Parakeet-TDT 1.1B ONNX
Pre-exported ONNX models of NVIDIA's Parakeet-TDT 1.1B for immediate use without requiring Docker/NeMo export.
Model Description
- Base Model: nvidia/parakeet-tdt-1.1b
- Architecture: Token-and-Duration Transducer (TDT) with FastConformer encoder
- Parameters: 1.1 billion
- Features: 80 mel filterbank features (16kHz)
- Vocabulary: 1024 SentencePiece tokens + blank token
- Format: ONNX Runtime compatible (CPU/GPU)
What's Included
Quantized Models (Recommended - 3.8GB total)
encoder.int8.onnx(1.1GB) - INT8 quantized encoder with external weightsencoder.int8.weights(2.0GB) - INT8 encoder weight filedecoder.int8.onnx(7.0MB) - INT8 quantized decoderjoiner.int8.onnx(1.7MB) - INT8 quantized joinertokens.txt(11KB) - Vocabulary file
Full Precision Models (Optional - 8.1GB additional)
encoder.onnx(41MB) +encoder.weights(4.0GB) - FP32 encoderdecoder.onnx(28MB) - FP32 decoderjoiner.onnx(6.6MB) - FP32 joiner- 639 layer weight files (various sizes)
Quick Start
Download Models
# Download all models (12GB)
huggingface-cli download jenerallee78/parakeet-tdt-1.1b-onnx --local-dir ./models
# Or download just INT8 quantized (3.8GB - recommended)
huggingface-cli download jenerallee78/parakeet-tdt-1.1b-onnx \
--include "encoder.int8.*" "decoder.int8.onnx" "joiner.int8.onnx" "tokens.txt" \
--local-dir ./models
Use with ONNX Runtime (Python)
import onnxruntime as ort
import numpy as np
# Load models
encoder_session = ort.InferenceSession("models/encoder.int8.onnx")
decoder_session = ort.InferenceSession("models/decoder.int8.onnx")
joiner_session = ort.InferenceSession("models/joiner.int8.onnx")
# Load vocabulary
with open("models/tokens.txt") as f:
vocab = [line.split()[0] for line in f]
# Inference (simplified example)
# ... (add mel feature extraction)
encoder_out = encoder_session.run(None, {"audio_signal": mel_features})[0]
# ... (add decoding loop with decoder and joiner)
Use with Rust
See the Swictation STT implementation for a complete Rust example using these models.
Verification
These models have been verified to produce identical outputs to the original NeMo export:
- β Correct transcriptions on test audio
- β MD5 checksums match original export
- β
Metadata verified (
feat_dim: 80,vocab_size: 1024) - β Tested with both short (6s) and long (84s) audio
Test Results
- Short audio: 100% accuracy - "hello world testing one two three"
- Long audio: 95% accuracy - Complex technical article (375 tokens from 8410 frames)
Export Details
Export Script: Available in Swictation repository
Export Method:
docker run --rm -v $(pwd):/workspace -w /workspace \
nvcr.io/nvidia/nemo:25.07 \
bash -c "pip install onnxruntime && python3 export_parakeet_tdt_1.1b.py"
Performance
- WER: 1.39% (LibriSpeech test-clean, from base model)
- Speed: 64% faster than RNNT baseline
- Inference: Real-time on modern CPUs, faster on GPU
- Blank Rate: 28.6% (excellent for continuous speech)
Technical Details
Model Architecture
- Encoder: 42-layer FastConformer (1024-dim)
- Decoder: 2-layer LSTM (640-dim hidden)
- Joiner: Feed-forward network
- Subsampling: 8x (8410 mel frames β 1051 encoder frames)
Metadata
{
"vocab_size": 1024,
"normalize_type": "per_feature",
"pred_rnn_layers": 2,
"pred_hidden": 640,
"subsampling_factor": 8,
"model_type": "EncDecRNNTBPEModel",
"feat_dim": 80
}
License
Same as the original NVIDIA Parakeet-TDT 1.1B model (CC-BY-4.0).
Citation
@misc{parakeet-tdt-1.1b-onnx,
author = {Robert Lee},
title = {Parakeet-TDT 1.1B ONNX Export},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/jenerallee78/parakeet-tdt-1.1b-onnx}},
note = {Verified ONNX export of NVIDIA's Parakeet-TDT 1.1B model}
}
Original model:
@misc{nvidia-parakeet-tdt,
author = {NVIDIA},
title = {Parakeet-TDT 1.1B},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/nvidia/parakeet-tdt-1.1b}}
}
Credits
- Base Model: NVIDIA Research
- ONNX Export: Swictation Project
- Verification: Extensive testing with Rust ONNX Runtime implementation