File size: 6,761 Bytes
5729ee7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# 🎯 AHA Moment #20 - Parakeet-TDT 1.1B Export Bug Analysis

**Date:** 2025-11-10
**Status:** βœ… ROOT CAUSE IDENTIFIED & FIXED

---

## πŸ” Problem Statement

sherpa-onnx (v1.12.15) failed to load the exported 1.1B model with error:
```
<blk> is not the last token!
```

Initial hypothesis was that tokens.txt format was incorrect, but deeper investigation revealed TWO critical metadata bugs in the export script.

---

## 🎯 Root Cause Analysis

### Bug #1: Incorrect vocab_size Metadata

**What We Did Wrong:**
```python
# export_parakeet_tdt_1.1b.py line 86 (WRONG!)
vocab_size = len(asr_model.joint.vocabulary) + 1  # = 1024 + 1 = 1025
```

**Why It's Wrong:**
- The metadata `vocab_size` should represent vocabulary ONLY (excluding blank token)
- sherpa-onnx adds the blank token internally
- For TDT models, sherpa-onnx validates: `joiner_output_dim - num_durations = vocab_size + 1`

**The Math:**
```
Our model:
- Joint vocabulary: 1024 tokens
- Joiner output: 1030 dimensions
- TDT durations: 5 (0, 1, 2, 3, 4)

sherpa-onnx calculation:
- Expected vocab_size = (1030 - num_durations) - 1
- With num_durations=4: (1030 - 4) - 1 = 1025 ❌ (but we set 1025)
- With num_durations=5: (1030 - 5) - 1 = 1024 βœ…

The issue: sherpa-onnx detected 4 durations instead of 5!
But even with 5, our vocab_size=1025 was still wrong!
```

**Official 0.6B Model for Comparison:**
- vocab_size (metadata): **1024** (NOT 1025!)
- tokens.txt lines: 1025 (1024 vocab + 1 blank)
- joiner output: 1030 (1024 + 1 blank + 5 durations)

**Correct Formula:**
```python
vocab_size = len(asr_model.joint.vocabulary)  # Don't add 1!
```

---

### Bug #2: Incorrect feat_dim Metadata

**What We Did Wrong:**
```python
# export_parakeet_tdt_1.1b.py line 120 (WRONG!)
"feat_dim": 128,  # Copied from 0.6B model
```

**Why It's Wrong:**
- The 1.1B model uses **80 mel filterbank features**, not 128
- This was discovered during Rust testing when encoder expected [batch, 80, time] inputs
- We incorrectly assumed it matched the 0.6B model's 128 features

**Official Specs:**
- 0.6B model: 128 mel features
- 1.1B model: **80 mel features** ← Critical difference!

**Correct Value:**
```python
"feat_dim": 80,  # 1.1B uses 80 mel features!
```

---

## πŸ”¬ Investigation Process

### 1. Initial Error
```
/project/sherpa-onnx/csrc/offline-recognizer-transducer-nemo-impl.h:PostInit:180
<blk> is not the last token!
```

### 2. Token File Analysis
- Checked tokens.txt format: **βœ… CORRECT** (1025 lines, <blk> at end)
- Compared with official 0.6B: Same format βœ…
- tokens.txt was NOT the problem!

### 3. ONNX Model Inspection
```python
# Joiner output dimensions
model.graph.output[0].shape: [batch, time, 1, 1030]
                                              ^^^^
# But vocabulary only has 1024 tokens + blank = 1025
# Missing 5 dimensions = TDT duration outputs!
```

### 4. TDT Model Discovery
- Researched TDT (Token-and-Duration Transducer) architecture
- Found: Model config shows `num_extra_outputs: 5`
- Found: Loss config shows `durations: [0, 1, 2, 3, 4]`
- **Insight:** Joiner outputs BOTH tokens AND durations!

### 5. sherpa-onnx Compatibility Research
- Discovered sherpa-onnx added TDT support in v1.11.5 (May 2025)
- Confirmed TDT support in v1.12.15 (our version) βœ…
- Found bug fix for TDT decoding in v1.12.14 (PR #2606)

### 6. Official Model Comparison
Downloaded official 0.6B model metadata:
```
vocab_size: 1024  ← AHA! Not 1025!
feat_dim: 128
```

Our 1.1B metadata (WRONG):
```
vocab_size: 1025  ← Should be 1024!
feat_dim: 128     ← Should be 80!
```

### 7. Debug Output Analysis
sherpa-onnx debug output revealed:
```
TDT model. vocab_size: 1026, num_durations: 4
```

But our metadata said `vocab_size=1025` β†’ **MISMATCH!**

Also: sherpa-onnx detected only 4 durations, but model has 5 durations!

---

## βœ… Solution

### Fixed Export Script

**File:** `/opt/swictation/scripts/export_parakeet_tdt_1.1b.py`

**Change 1 (Line 87):**
```python
# BEFORE:
vocab_size = len(asr_model.joint.vocabulary) + 1

# AFTER:
# CRITICAL: vocab_size should NOT include blank token (sherpa-onnx adds it internally)
vocab_size = len(asr_model.joint.vocabulary)
```

**Change 2 (Line 122):**
```python
# BEFORE:
"feat_dim": 128,  # Mel filterbank features

# AFTER:
"feat_dim": 80,  # CRITICAL: 1.1B uses 80 mel features, not 128!
```

### Re-Export Process
```bash
docker run --rm \
  -v /opt/swictation:/workspace \
  -w /workspace/models/parakeet-tdt-1.1b \
  nvcr.io/nvidia/nemo:25.07 \
  bash -c "pip install onnxruntime && python3 /workspace/scripts/export_parakeet_tdt_1.1b.py"
```

---

## πŸ“Š Comparison: 0.6B vs 1.1B

| Feature | 0.6B | 1.1B |
|---------|------|------|
| Vocabulary Size | 1024 | 1024 |
| Mel Features | 128 | **80** |
| Encoder Output Dim | 128 | **1024** |
| Decoder Hidden | 640 | 640 |
| TDT Durations | 5 (0-4) | 5 (0-4) |
| tokens.txt Lines | 1025 | 1025 |
| Joiner Output | 1030 | 1030 |
| **vocab_size (metadata)** | **1024** | **1024** βœ… |
| **feat_dim (metadata)** | **128** | **80** βœ… |

---

## πŸŽ“ Key Learnings

1. **Metadata vs File Format:**
   - `vocab_size` metadata = vocabulary only (no blank)
   - `tokens.txt` file = vocabulary + blank token
   - These are different!

2. **TDT Model Architecture:**
   - Joiner outputs = vocab + blank + duration outputs
   - For 1.1B: 1024 + 1 + 5 = 1030 total outputs
   - Duration outputs (0-4) allow frame skipping for faster inference

3. **sherpa-onnx TDT Support:**
   - Requires v1.11.5+ for TDT models
   - Validates: `joiner_output_dim - num_durations = vocab_size + 1`
   - Internally shifts blank token to position 0 (NeMo has it at end)

4. **Model-Specific Parameters:**
   - Don't assume all models from same family have same parameters!
   - 0.6B uses 128 mel features, 1.1B uses 80
   - Always check model config for exact specifications

5. **Validation Strategy:**
   - Use reference implementation (sherpa-onnx) to validate exports
   - Compare with known-working models (0.6B)
   - Check debug output for mismatch clues

---

## βœ… Next Steps

1. **Validation:** Test re-exported model with sherpa-onnx validation script
2. **Rust Integration:** Test with Rust recognizer once validation passes
3. **Documentation:** Update README with correct metadata requirements
4. **Testing:** Add automated tests to catch metadata bugs in future exports

---

## πŸ“ References

- sherpa-onnx TDT support: https://github.com/k2-fsa/sherpa-onnx/issues/2183
- Official 0.6B model: https://huggingface.co/csukuangfj/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8
- NeMo 1.1B model: https://huggingface.co/nvidia/parakeet-tdt-1.1b
- TDT paper: Token-and-Duration Transducer for fast inference

---

**Status:** πŸ”„ Re-exporting with corrected metadata...