## Speech Semantic Tokenizer As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq). On the other hand, our decoder is relatively simple, consisting of only four CNN layers. We believe that a simple and weak decoder is the key to training the tokenizer.

To run this semantic tokenizer alone, the required packages should be installed. ```bash # install requirements for this semantic tokenizer on Ascend 910B # for GPUs, just remove torch-npu==2.5.1 pip install -r requirements_npu.txt ```