File size: 6,761 Bytes
5729ee7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
# π― AHA Moment #20 - Parakeet-TDT 1.1B Export Bug Analysis
**Date:** 2025-11-10
**Status:** β
ROOT CAUSE IDENTIFIED & FIXED
---
## π Problem Statement
sherpa-onnx (v1.12.15) failed to load the exported 1.1B model with error:
```
<blk> is not the last token!
```
Initial hypothesis was that tokens.txt format was incorrect, but deeper investigation revealed TWO critical metadata bugs in the export script.
---
## π― Root Cause Analysis
### Bug #1: Incorrect vocab_size Metadata
**What We Did Wrong:**
```python
# export_parakeet_tdt_1.1b.py line 86 (WRONG!)
vocab_size = len(asr_model.joint.vocabulary) + 1 # = 1024 + 1 = 1025
```
**Why It's Wrong:**
- The metadata `vocab_size` should represent vocabulary ONLY (excluding blank token)
- sherpa-onnx adds the blank token internally
- For TDT models, sherpa-onnx validates: `joiner_output_dim - num_durations = vocab_size + 1`
**The Math:**
```
Our model:
- Joint vocabulary: 1024 tokens
- Joiner output: 1030 dimensions
- TDT durations: 5 (0, 1, 2, 3, 4)
sherpa-onnx calculation:
- Expected vocab_size = (1030 - num_durations) - 1
- With num_durations=4: (1030 - 4) - 1 = 1025 β (but we set 1025)
- With num_durations=5: (1030 - 5) - 1 = 1024 β
The issue: sherpa-onnx detected 4 durations instead of 5!
But even with 5, our vocab_size=1025 was still wrong!
```
**Official 0.6B Model for Comparison:**
- vocab_size (metadata): **1024** (NOT 1025!)
- tokens.txt lines: 1025 (1024 vocab + 1 blank)
- joiner output: 1030 (1024 + 1 blank + 5 durations)
**Correct Formula:**
```python
vocab_size = len(asr_model.joint.vocabulary) # Don't add 1!
```
---
### Bug #2: Incorrect feat_dim Metadata
**What We Did Wrong:**
```python
# export_parakeet_tdt_1.1b.py line 120 (WRONG!)
"feat_dim": 128, # Copied from 0.6B model
```
**Why It's Wrong:**
- The 1.1B model uses **80 mel filterbank features**, not 128
- This was discovered during Rust testing when encoder expected [batch, 80, time] inputs
- We incorrectly assumed it matched the 0.6B model's 128 features
**Official Specs:**
- 0.6B model: 128 mel features
- 1.1B model: **80 mel features** β Critical difference!
**Correct Value:**
```python
"feat_dim": 80, # 1.1B uses 80 mel features!
```
---
## π¬ Investigation Process
### 1. Initial Error
```
/project/sherpa-onnx/csrc/offline-recognizer-transducer-nemo-impl.h:PostInit:180
<blk> is not the last token!
```
### 2. Token File Analysis
- Checked tokens.txt format: **β
CORRECT** (1025 lines, <blk> at end)
- Compared with official 0.6B: Same format β
- tokens.txt was NOT the problem!
### 3. ONNX Model Inspection
```python
# Joiner output dimensions
model.graph.output[0].shape: [batch, time, 1, 1030]
^^^^
# But vocabulary only has 1024 tokens + blank = 1025
# Missing 5 dimensions = TDT duration outputs!
```
### 4. TDT Model Discovery
- Researched TDT (Token-and-Duration Transducer) architecture
- Found: Model config shows `num_extra_outputs: 5`
- Found: Loss config shows `durations: [0, 1, 2, 3, 4]`
- **Insight:** Joiner outputs BOTH tokens AND durations!
### 5. sherpa-onnx Compatibility Research
- Discovered sherpa-onnx added TDT support in v1.11.5 (May 2025)
- Confirmed TDT support in v1.12.15 (our version) β
- Found bug fix for TDT decoding in v1.12.14 (PR #2606)
### 6. Official Model Comparison
Downloaded official 0.6B model metadata:
```
vocab_size: 1024 β AHA! Not 1025!
feat_dim: 128
```
Our 1.1B metadata (WRONG):
```
vocab_size: 1025 β Should be 1024!
feat_dim: 128 β Should be 80!
```
### 7. Debug Output Analysis
sherpa-onnx debug output revealed:
```
TDT model. vocab_size: 1026, num_durations: 4
```
But our metadata said `vocab_size=1025` β **MISMATCH!**
Also: sherpa-onnx detected only 4 durations, but model has 5 durations!
---
## β
Solution
### Fixed Export Script
**File:** `/opt/swictation/scripts/export_parakeet_tdt_1.1b.py`
**Change 1 (Line 87):**
```python
# BEFORE:
vocab_size = len(asr_model.joint.vocabulary) + 1
# AFTER:
# CRITICAL: vocab_size should NOT include blank token (sherpa-onnx adds it internally)
vocab_size = len(asr_model.joint.vocabulary)
```
**Change 2 (Line 122):**
```python
# BEFORE:
"feat_dim": 128, # Mel filterbank features
# AFTER:
"feat_dim": 80, # CRITICAL: 1.1B uses 80 mel features, not 128!
```
### Re-Export Process
```bash
docker run --rm \
-v /opt/swictation:/workspace \
-w /workspace/models/parakeet-tdt-1.1b \
nvcr.io/nvidia/nemo:25.07 \
bash -c "pip install onnxruntime && python3 /workspace/scripts/export_parakeet_tdt_1.1b.py"
```
---
## π Comparison: 0.6B vs 1.1B
| Feature | 0.6B | 1.1B |
|---------|------|------|
| Vocabulary Size | 1024 | 1024 |
| Mel Features | 128 | **80** |
| Encoder Output Dim | 128 | **1024** |
| Decoder Hidden | 640 | 640 |
| TDT Durations | 5 (0-4) | 5 (0-4) |
| tokens.txt Lines | 1025 | 1025 |
| Joiner Output | 1030 | 1030 |
| **vocab_size (metadata)** | **1024** | **1024** β
|
| **feat_dim (metadata)** | **128** | **80** β
|
---
## π Key Learnings
1. **Metadata vs File Format:**
- `vocab_size` metadata = vocabulary only (no blank)
- `tokens.txt` file = vocabulary + blank token
- These are different!
2. **TDT Model Architecture:**
- Joiner outputs = vocab + blank + duration outputs
- For 1.1B: 1024 + 1 + 5 = 1030 total outputs
- Duration outputs (0-4) allow frame skipping for faster inference
3. **sherpa-onnx TDT Support:**
- Requires v1.11.5+ for TDT models
- Validates: `joiner_output_dim - num_durations = vocab_size + 1`
- Internally shifts blank token to position 0 (NeMo has it at end)
4. **Model-Specific Parameters:**
- Don't assume all models from same family have same parameters!
- 0.6B uses 128 mel features, 1.1B uses 80
- Always check model config for exact specifications
5. **Validation Strategy:**
- Use reference implementation (sherpa-onnx) to validate exports
- Compare with known-working models (0.6B)
- Check debug output for mismatch clues
---
## β
Next Steps
1. **Validation:** Test re-exported model with sherpa-onnx validation script
2. **Rust Integration:** Test with Rust recognizer once validation passes
3. **Documentation:** Update README with correct metadata requirements
4. **Testing:** Add automated tests to catch metadata bugs in future exports
---
## π References
- sherpa-onnx TDT support: https://github.com/k2-fsa/sherpa-onnx/issues/2183
- Official 0.6B model: https://huggingface.co/csukuangfj/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8
- NeMo 1.1B model: https://huggingface.co/nvidia/parakeet-tdt-1.1b
- TDT paper: Token-and-Duration Transducer for fast inference
---
**Status:** π Re-exporting with corrected metadata...
|