bezzam HF Staff commited on
Commit
80d2536
·
verified ·
1 Parent(s): d0dc75d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -2
README.md CHANGED
@@ -9,6 +9,142 @@ base_model:
9
 
10
  # X-Codec (speech, WavLM)
11
 
12
- This codec is intended for speech data.
13
 
14
- Original model is `xcodec_wavlm_more_data` from [this table](https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  # X-Codec (speech, WavLM)
11
 
12
+ This codec is part of the X-Codec family of codecs as shown below:
13
 
14
+ | Model checkpoint | Semantic Model | Domain | Training Data |
15
+ |--------------------------------------------|-----------------------------------------------------------------------|---------------|-------------------------------|
16
+ | [xcodec-hubert-librispeech](https://huggingface.co/hf-audio/xcodec-hubert-librispeech) | [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) | Speech | Librispeech |
17
+ | [xcodec-wavlm-mls](https://huggingface.co/hf-audio/xcodec-wavlm-mls) | [microsoft/wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)| Speech | MLS English |
18
+ | [xcodec-wavlm-more-data](https://huggingface.co/hf-audio/xcodec-wavlm-more-data) (this model) | [microsoft/wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)| Speech | MLS English + Internal data |
19
+ | [xcodec-hubert-general](https://huggingface.co/hf-audio/xcodec-hubert-general) | [ZhenYe234/hubert_base_general_audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) | General audio | 200k hours internal data |
20
+ | [xcodec-hubert-general-balanced](https://huggingface.co/hf-audio/xcodec-hubert-general-balanced) | [ZhenYe234/hubert_base_general_audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) | General audio | More balanced data |
21
+
22
+ Original model is `xcodec_wavlm_more_data` from [this table](https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models).
23
+
24
+ ## Example usage
25
+
26
+ The example below applies the codec over all possible bandwidths.
27
+
28
+ ```python
29
+
30
+ from datasets import Audio, load_dataset
31
+ from transformers import XcodecModel, AutoFeatureExtractor
32
+ import torch
33
+ import os
34
+ from scipy.io.wavfile import write as write_wav
35
+
36
+
37
+ model_id = "hf-audio/xcodec-wavlm-more-data"
38
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
39
+ available_bandwidths = [0.5, 1, 1.5, 2, 4]
40
+
41
+ # load model
42
+ model = XcodecModel.from_pretrained(model_id, device_map=torch_device)
43
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
44
+
45
+ # load audio example
46
+ librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
47
+ librispeech_dummy = librispeech_dummy.cast_column(
48
+ "audio", Audio(sampling_rate=feature_extractor.sampling_rate)
49
+ )
50
+ audio_array = librispeech_dummy[0]["audio"]["array"]
51
+ inputs = feature_extractor(
52
+ raw_audio=audio_array, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt"
53
+ ).to(model.device)
54
+ audio = inputs["input_values"]
55
+
56
+ for bandwidth in available_bandwidths:
57
+ print(f"Encoding with bandwidth: {bandwidth} kbps")
58
+ # encode
59
+ audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False)
60
+ print("Codebook shape", audio_codes.shape)
61
+ # 0.5 kbps -> torch.Size([1, 1, 293])
62
+ # 1.0 kbps -> torch.Size([1, 2, 293])
63
+ # 1.5 kbps -> torch.Size([1, 3, 293])
64
+ # 2.0 kbps -> torch.Size([1, 4, 293])
65
+ # 4.0 kbps -> torch.Size([1, 8, 293])
66
+
67
+ # decode
68
+ input_values_dec = model.decode(audio_codes).audio_values
69
+
70
+ # save audio to file
71
+ write_wav(f"{os.path.basename(model_id)}_{bandwidth}.wav", feature_extractor.sampling_rate, input_values_dec.squeeze().detach().cpu().numpy())
72
+
73
+ write_wav("original.wav", feature_extractor.sampling_rate, audio.squeeze().detach().cpu().numpy())
74
+ ```
75
+
76
+ ### 🔊 Audio Samples
77
+
78
+ **Original**
79
+ <audio controls>
80
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/original.wav" type="audio/wav">
81
+ </audio>
82
+
83
+ **0.5 kbps**
84
+ <audio controls>
85
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-wavlm-more-data_0.5.wav" type="audio/wav">
86
+ </audio>
87
+
88
+ **1 kbps**
89
+ <audio controls>
90
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-wavlm-more-data_1.wav" type="audio/wav">
91
+ </audio>
92
+
93
+ **1.5 kbps**
94
+ <audio controls>
95
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-wavlm-more-data_1.5.wav" type="audio/wav">
96
+ </audio>
97
+
98
+ **2 kbps**
99
+ <audio controls>
100
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-wavlm-more-data_2.wav" type="audio/wav">
101
+ </audio>
102
+
103
+ **4 kbps**
104
+ <audio controls>
105
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-wavlm-more-data_4.wav" type="audio/wav">
106
+ </audio>
107
+
108
+ ## Batch example
109
+
110
+ ```python
111
+
112
+ from datasets import Audio, load_dataset
113
+ from transformers import XcodecModel, AutoFeatureExtractor
114
+ import torch
115
+
116
+
117
+ model_id = "hf-audio/xcodec-wavlm-more-data"
118
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
119
+ bandwidth = 4
120
+ n_audio = 2 # number of audio samples to process in a batch
121
+
122
+ # load model
123
+ model = XcodecModel.from_pretrained(model_id, device_map=torch_device)
124
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
125
+
126
+ # load audio example
127
+ ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
128
+ ds = ds.cast_column(
129
+ "audio", Audio(sampling_rate=feature_extractor.sampling_rate)
130
+ )
131
+ audio = [audio_sample["array"] for audio_sample in ds[-n_audio:]["audio"]]
132
+ print(f"Input audio shape: {[_sample.shape for _sample in audio]}")
133
+ # Input audio shape: [(113840,), (71680,)]
134
+ inputs = feature_extractor(
135
+ raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt"
136
+ ).to(model.device)
137
+ audio = inputs["input_values"]
138
+ print(f"Padded audio shape: {audio.shape}")
139
+ # Padded audio shape: torch.Size([2, 1, 113920])
140
+
141
+ # encode
142
+ audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False)
143
+ print("Codebook shape", audio_codes.shape)
144
+ # Codebook shape torch.Size([2, 8, 356])
145
+
146
+ # decode
147
+ decoded_audio = model.decode(audio_codes).audio_values
148
+ print("Decoded audio shape", decoded_audio.shape)
149
+ # Decoded audio shape torch.Size([2, 1, 113920])
150
+ ```