parakeet-tdt-1.1b-onnx / TESTING_NOTES.md
jenerallee78's picture
Add verified ONNX export of Parakeet-TDT 1.1B
5729ee7 verified

Parakeet-TDT 1.1B Testing Notes

Export Status: βœ… SUCCESS

Exported successfully on 2025-11-10 using NVIDIA NeMo Docker container (25.07).

Integration Status: ⚠️ PARTIAL - Needs Debugging

What Works

  • βœ… Model loads successfully with ONNX Runtime
  • βœ… Encoder processes audio (80 mel features β†’ 1024-dim output)
  • βœ… Decoder and joiner execute without errors
  • βœ… Inference completes in ~545ms on GPU

What Doesn't Work

  • ❌ Transcription produces nonsense output ("mmhmm" instead of real speech)
  • ❌ Decoder outputs mostly blank tokens with few random tokens

Technical Details

Model Requirements

  • Mel Features: 80 (NOT 128 like 0.6B model)
  • Encoder Input: [batch, 80, time]
  • Encoder Output: [batch, 1024, time]
  • Decoder: 2 RNN layers, 640 hidden dimension
  • Vocabulary: 1025 tokens including <blk> (blank)

Files

encoder.int8.onnx    1.1GB  (INT8 quantized)
decoder.int8.onnx    7.0MB  (INT8 quantized)
joiner.int8.onnx     1.7MB  (INT8 quantized)
tokens.txt           11KB   (1025 tokens)

Known Issues

Issue #1: Incorrect Transcription

Symptoms:

  • Input: "en-short.mp3" (normal speech)
  • Output: "mmhmm" (nonsense)
  • Token IDs: [19, 1010, 1005, 1010, 1010]

Possible Causes:

  1. Export Format Mismatch: NeMo export may produce slightly different format than expected
  2. Decoder Logic Bug: Our decoder/joiner implementation may not match 1.1B requirements
  3. Audio Preprocessing: Subtle differences in mel-spectrogram computation
  4. Model-Specific Parameters: 1.1B may have different expectations than 0.6B

Debug Attempts:

  • βœ… Fixed mel feature count (128 β†’ 80)
  • βœ… Verified encoder input/output shapes
  • βœ… Confirmed vocabulary format
  • ❌ Cannot verify with sherpa-onnx CLI (installation issues)

Comparison: 0.6B vs 1.1B

Feature 0.6B 1.1B
Mel Features 128 80
Parameters 600M 1.1B
Encoder Output Dim 128 1024
Works in Rust? βœ… Yes ⚠️ Partially

Next Steps

Short Term (Recommended)

  1. Use 0.6B model for production (proven to work)
  2. Document 1.1B as "experimental"
  3. Create test suite to verify correct transcription

Long Term

  1. Verify export with sherpa-onnx reference implementation
  2. Compare our preprocessing with official NeMo preprocessing
  3. Debug decoder/joiner logic with verbose logging
  4. Consider using parakeet-rs if it supports 1.1B directly

Testing Commands

Build Test Binary

cargo build --release --example test_1_1b_direct

Run Test

cargo run --release --example test_1_1b_direct /opt/swictation/examples/en-short.mp3

Expected vs Actual

  • Expected: Real transcription of audio content
  • Actual: "mmhmm" (5 tokens, mostly blanks)

References

  • Export script: /opt/swictation/scripts/export_parakeet_tdt_1.1b.py
  • Rust implementation: /opt/swictation/rust-crates/swictation-stt/src/recognizer_ort.rs
  • Audio preprocessing: /opt/swictation/rust-crates/swictation-stt/src/audio.rs
  • Test program: /opt/swictation/rust-crates/swictation-stt/examples/test_1_1b_direct.rs

Conclusion

The 1.1B model export was successful, and we've learned critical details about its requirements. However, the Rust integration needs additional debugging to produce correct transcriptions. The 0.6B model remains the recommended choice for production use until 1.1B is fully validated.


Last updated: 2025-11-10 Status: Blocked on transcription validation