Automatic Speech Recognition
Keras
ONNX
English

How to handle tokenizer.bin in the latest Moonshine ONNX models?

#11
by Aleister-clowly - opened

I noticed that both the float and quantized versions of the Moonshine ONNX models now only include a tokenizer.bin file, instead of the previous tokenizer.json.

When trying to load it in Python (using tokenizers or transformers), it throws errors like:

                                                             Cannot instantiate Tokenizer from buffer: expected value at line 1 column 1

So I would like to ask:

What format does this tokenizer.bin use? (FlatBuffer / custom binary?)

Is there any official or recommended way to load it in Python, or convert it to a standard Hugging Face JSON tokenizer format?

If it’s meant only for C++/embedded SDKs, is there any plan to release a compatible tokenizer for Python users again?

Thanks in advance! I believe this would help many developers trying to use Moonshine locally for ONNX inference.

Sign up or log in to comment