Update README.md
Browse files
README.md
CHANGED
|
@@ -4,6 +4,8 @@ license: cc-by-4.0
|
|
| 4 |
tags:
|
| 5 |
- pytorch
|
| 6 |
- NeMo
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# Typhoon-asr-realtime
|
|
@@ -14,183 +16,92 @@ img {
|
|
| 14 |
}
|
| 15 |
</style>
|
| 16 |
|
| 17 |
-
[
|
| 44 |
-
```
|
| 45 |
-
|
| 46 |
-
### NOTE
|
| 47 |
-
|
| 48 |
-
Add some information about how to use the model here. An example is provided for ASR inference below.
|
| 49 |
-
|
| 50 |
-
### Transcribing using Python
|
| 51 |
-
First, let's get a sample
|
| 52 |
-
```
|
| 53 |
-
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
|
| 54 |
-
```
|
| 55 |
-
Then simply do:
|
| 56 |
-
```
|
| 57 |
-
asr_model.transcribe(['2086-149220-0033.wav'])
|
| 58 |
-
```
|
| 59 |
-
|
| 60 |
-
### Transcribing many audio files
|
| 61 |
-
|
| 62 |
-
```shell
|
| 63 |
-
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="scb10x/typhoon-asr-realtime" audio_dir=""
|
| 64 |
-
```
|
| 65 |
-
|
| 66 |
-
### Input
|
| 67 |
-
|
| 68 |
-
**Add some information about what are the inputs to this model**
|
| 69 |
-
|
| 70 |
-
### Output
|
| 71 |
-
|
| 72 |
-
**Add some information about what are the outputs of this model**
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
**Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**
|
| 81 |
|
| 82 |
-
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
| 89 |
|
|
|
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
datasets:
|
| 100 |
-
- librispeech_asr
|
| 101 |
-
- fisher_corpus
|
| 102 |
-
- Switchboard-1
|
| 103 |
-
- WSJ-0
|
| 104 |
-
- WSJ-1
|
| 105 |
-
- National-Singapore-Corpus-Part-1
|
| 106 |
-
- National-Singapore-Corpus-Part-6
|
| 107 |
-
- vctk
|
| 108 |
-
- voxpopuli
|
| 109 |
-
- europarl
|
| 110 |
-
- multilingual_librispeech
|
| 111 |
-
- mozilla-foundation/common_voice_8_0
|
| 112 |
-
- MLCommons/peoples_speech
|
| 113 |
-
|
| 114 |
-
The corresponding text in this section for those datasets is stated below -
|
| 115 |
-
|
| 116 |
-
The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
|
| 117 |
-
|
| 118 |
-
The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
|
| 119 |
-
|
| 120 |
-
- Librispeech 960 hours of English speech
|
| 121 |
-
- Fisher Corpus
|
| 122 |
-
- Switchboard-1 Dataset
|
| 123 |
-
- WSJ-0 and WSJ-1
|
| 124 |
-
- National Speech Corpus (Part 1, Part 6)
|
| 125 |
-
- VCTK
|
| 126 |
-
- VoxPopuli (EN)
|
| 127 |
-
- Europarl-ASR (EN)
|
| 128 |
-
- Multilingual Librispeech (MLS EN) - 2,000 hour subset
|
| 129 |
-
- Mozilla Common Voice (v7.0)
|
| 130 |
-
- People's Speech - 12,000 hour subset
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
## Performance
|
| 134 |
|
| 135 |
-
**
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
- name: PUT_MODEL_NAME
|
| 143 |
-
results:
|
| 144 |
-
- task:
|
| 145 |
-
name: Automatic Speech Recognition
|
| 146 |
-
type: automatic-speech-recognition
|
| 147 |
-
dataset:
|
| 148 |
-
name: AMI (Meetings test)
|
| 149 |
-
type: edinburghcstr/ami
|
| 150 |
-
config: ihm
|
| 151 |
-
split: test
|
| 152 |
-
args:
|
| 153 |
-
language: en
|
| 154 |
-
metrics:
|
| 155 |
-
- name: Test WER
|
| 156 |
-
type: wer
|
| 157 |
-
value: 17.10
|
| 158 |
-
- task:
|
| 159 |
-
name: Automatic Speech Recognition
|
| 160 |
-
type: automatic-speech-recognition
|
| 161 |
-
dataset:
|
| 162 |
-
name: Earnings-22
|
| 163 |
-
type: revdotcom/earnings22
|
| 164 |
-
split: test
|
| 165 |
-
args:
|
| 166 |
-
language: en
|
| 167 |
-
metrics:
|
| 168 |
-
- name: Test WER
|
| 169 |
-
type: wer
|
| 170 |
-
value: 14.11
|
| 171 |
|
| 172 |
-
|
|
|
|
|
|
|
| 173 |
|
| 174 |
-
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
-
|
| 177 |
|
| 178 |
-
|
| 179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
|
|
|
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
|
|
|
|
| 186 |
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
## License
|
| 189 |
|
| 190 |
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
|
| 191 |
-
|
| 192 |
-
## References
|
| 193 |
-
|
| 194 |
-
**Provide appropriate references in the markdown link format below. Please order them numerically.**
|
| 195 |
-
|
| 196 |
-
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
|
|
|
| 4 |
tags:
|
| 5 |
- pytorch
|
| 6 |
- NeMo
|
| 7 |
+
base_model:
|
| 8 |
+
- nvidia/stt_en_fastconformer_transducer_large
|
| 9 |
---
|
| 10 |
|
| 11 |
# Typhoon-asr-realtime
|
|
|
|
| 16 |
}
|
| 17 |
</style>
|
| 18 |
|
| 19 |
+
| [](#model-architecture)
|
| 20 |
+
| [](#model-architecture)
|
| 21 |
+
| [](#datasets)
|
| 22 |
|
| 23 |
+
Typhoon ASR Real-Time is a next-generation, open-source Automatic Speech Recognition (ASR) model built specifically for real-world streaming applications in the Thai language. It is designed to deliver fast and accurate transcriptions while running efficiently on standard CPUs. This enables users to host their own ASR service without requiring expensive, specialized hardware or relying on third-party cloud services for sensitive data.
|
| 24 |
|
| 25 |
+
The model is based on [NVIDIA's FastConformer Transducer model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer), which is optimized for low-latency, real-time performance.
|
| 26 |
|
| 27 |
|
| 28 |
+
**Try our demo available on [Demo]()**
|
| 29 |
|
| 30 |
+
**Code / Examples available on [Github](https://github.com/scb-10x/typhoon-asr)**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
**Release Blog available on [OpenTyphoon Blog](https://opentyphoon.ai/blog/en/typhoon-asr-realtime-release)**
|
| 33 |
|
| 34 |
+
***
|
| 35 |
|
| 36 |
+
### Performance
|
| 37 |
|
|
|
|
| 38 |
|
| 39 |
+
***
|
| 40 |
|
| 41 |
+
### Usage and Implementation
|
| 42 |
|
| 43 |
+
**(Recommended): Quick Start with Google Colab**
|
| 44 |
|
| 45 |
+
For a hands-on demonstration without any local setup, you can run this project directly in Google Colab. The notebook provides a complete environment to transcribe audio files and experiment with the model.
|
| 46 |
|
| 47 |
+
[](https://colab.research.google.com/drive/1t4tlRTJToYRolTmiN5ZWDR67ymdRnpAz?usp=sharing)
|
| 48 |
|
| 49 |
+
**(Recommended): Using the `typhoon-asr` Package**
|
| 50 |
|
| 51 |
+
This is the easiest way to get started. You can install the package via pip and use it directly from the command line or within your Python code.
|
| 52 |
|
| 53 |
+
**1. Install the package:**
|
| 54 |
+
```bash
|
| 55 |
+
pip install typhoon-asr
|
| 56 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
**2. Command-Line Usage:**
|
| 59 |
+
```bash
|
| 60 |
+
# Basic transcription (auto-detects device)
|
| 61 |
+
typhoon-asr path/to/your_audio.wav
|
| 62 |
|
| 63 |
+
# Transcription with timestamps on a specific device
|
| 64 |
+
typhoon-asr path/to/your_audio.mp3 --with-timestamps --device cuda
|
| 65 |
+
```
|
| 66 |
|
| 67 |
+
**3. Python API Usage:**
|
| 68 |
+
```python
|
| 69 |
+
from typhoon_asr import transcribe
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
+
# Basic transcription
|
| 72 |
+
result = transcribe("path/to/your_audio.wav")
|
| 73 |
+
print(result['text'])
|
| 74 |
|
| 75 |
+
# Transcription with timestamps
|
| 76 |
+
result_with_timestamps = transcribe("path/to/your_audio.wav", with_timestamps=True)
|
| 77 |
+
print(result_with_timestamps)
|
| 78 |
+
```
|
| 79 |
|
| 80 |
+
**(Alternative): Running from the Repository Script**
|
| 81 |
|
| 82 |
+
You can also run the model by cloning the repository and using the inference script directly. This method is useful for development or if you need to modify the underlying code.
|
| 83 |
|
| 84 |
+
**1. Clone the repository and install dependencies:**
|
| 85 |
+
```bash
|
| 86 |
+
git clone https://github.com/scb10x/typhoon-asr.git
|
| 87 |
+
cd typhoon-asr
|
| 88 |
+
pip install -r requirements.txt
|
| 89 |
+
```
|
| 90 |
|
| 91 |
+
**2. Run the inference script:**
|
| 92 |
+
The `typhoon_asr_inference.py` script handles audio resampling and processing automatically.
|
| 93 |
|
| 94 |
+
```bash
|
| 95 |
+
# Basic Transcription (CPU):
|
| 96 |
+
python typhoon_asr_inference.py path/to/your_audio.m4a
|
| 97 |
|
| 98 |
+
# Transcription with Estimated Timestamps:
|
| 99 |
+
python typhoon_asr_inference.py path/to/your_audio.wav --with-timestamps
|
| 100 |
|
| 101 |
+
# Transcription on a GPU:
|
| 102 |
+
python typhoon_asr_inference.py path/to/your_audio.mp3 --device cuda
|
| 103 |
+
```
|
| 104 |
|
| 105 |
## License
|
| 106 |
|
| 107 |
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|