## Speech Semantic Tokenizer
As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq). On the other hand, our decoder is relatively simple, consisting of only four CNN layers. We believe that a simple and weak decoder is the key to training the tokenizer.
<p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p> 


To run this semantic tokenizer alone, the required packages should be installed.
```bash
# install requirements for this semantic tokenizer on Ascend 910B
# for GPUs, just remove torch-npu==2.5.1
pip install -r requirements_npu.txt
```