|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- google/fleurs |
|
|
- facebook/voxpopuli |
|
|
- facebook/multilingual_librispeech |
|
|
- mozilla-foundation/common_voice_13_0 |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
language: |
|
|
- fr |
|
|
- en |
|
|
metrics: |
|
|
- wer |
|
|
base_model: |
|
|
- openai/whisper-large-v3 |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
library_name: transformers |
|
|
tags: |
|
|
- speech-recognition |
|
|
- whisper |
|
|
- french |
|
|
- stt |
|
|
- multilingual |
|
|
- research |
|
|
- gilbert |
|
|
--- |
|
|
|
|
|
# Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition |
|
|
|
|
|
`Gilbert-FR-Source` is a French automatic speech recognition (ASR) model used as the **research foundation** for the Gilbert project. |
|
|
It is designed as an internal scientific baseline enabling controlled experimentation, reproducible evaluation, and rigorous comparison across ASR architectures, datasets, and adaptation methods. |
|
|
|
|
|
This model is not a fine-tuned derivative, but a **curated research anchor** used to support systematic studies in: |
|
|
|
|
|
- domain adaptation, |
|
|
- robustness to spontaneous and long-form speech, |
|
|
- accented and low-resource linguistic profiles, |
|
|
- telephony and bandwidth-constrained speech, |
|
|
- multi-speaker and meeting transcription. |
|
|
|
|
|
--- |
|
|
|
|
|
## 1. Research Motivation |
|
|
|
|
|
The Gilbert project aims to build highly specialized ASR systems optimized for: |
|
|
|
|
|
- professional meeting transcription (hybrid/remote), |
|
|
- long-form multi-speaker discourse, |
|
|
- institutional environments (education, public sector), |
|
|
- constrained audio conditions (telephony, VoIP, low SNR), |
|
|
- sociolinguistic diversity (African, Canadian, Belgian and other French accents). |
|
|
|
|
|
While Whisper Large V3 provides strong baseline performance, its behavior under domain shifts (spontaneous interactions, overlapping speech, degraded microphones) requires systematic study. |
|
|
`Gilbert-FR-Source` provides the **frozen starting point** for this line of research, ensuring controlled comparisons between experiments. |
|
|
|
|
|
--- |
|
|
|
|
|
## 2. Scientific Goals and Research Questions |
|
|
|
|
|
This model is used to answer a series of research questions: |
|
|
|
|
|
### **Q1. Long-form modeling** |
|
|
How does Whisper-L3 behave on meetings lasting 30–120 minutes, with natural topic shifts, interruptions, and pragmatic markers? |
|
|
|
|
|
### **Q2. Accent robustness** |
|
|
Which classes of French accents induce the strongest WER degradation? |
|
|
How does robustness vary across FLEURS, African French, and Common Voice subsets? |
|
|
|
|
|
### **Q3. Telephony adaptation** |
|
|
What is the degradation curve when downsampling to 16 kHz / 8 kHz / μ-law compressed audio? |
|
|
|
|
|
### **Q4. Domain adaptation efficiency** |
|
|
What is the marginal gain of targeted fine-tuning on professional meeting datasets (education, administration, healthcare)? |
|
|
|
|
|
### **Q5. Multilingual side-effects** |
|
|
To what extent does French fine-tuning affect cross-lingual generalization? |
|
|
|
|
|
These research axes structure the development of future specialized Gilbert models. |
|
|
|
|
|
--- |
|
|
|
|
|
## 3. Benchmark Reference Results |
|
|
|
|
|
The following WER scores originate from established open benchmarks and serve as a *reference baseline* for future experiments: |
|
|
|
|
|
| Dataset | WER | |
|
|
|--------|-----| |
|
|
| MLS (FR) | 3.98 % | |
|
|
| Common Voice FR (v13.0) | 7.28 % | |
|
|
| VoxPopuli (FR) | 8.91 % | |
|
|
| Fleurs (FR) | 4.84 % | |
|
|
| African Accented French | 4.20 % | |
|
|
|
|
|
These results provide **upper bounds** before targeted fine-tuning. |
|
|
Future Gilbert variants will be evaluated using: |
|
|
|
|
|
- internal meeting datasets, |
|
|
- domain-specific corpora (administration, higher education, healthcare), |
|
|
- accented speech corpora, |
|
|
- telephony datasets, |
|
|
- long-form evaluation methods (> 1 hour audio). |
|
|
|
|
|
--- |
|
|
|
|
|
## 4. Architecture |
|
|
|
|
|
The model is based on the **Whisper Large V3** encoder–decoder architecture, offering: |
|
|
|
|
|
- large multilingual pretraining, |
|
|
- long-context modeling capacity, |
|
|
- robust cross-lingual alignment, |
|
|
- stable decoding for long outputs, |
|
|
- strong zero-shot performance on French. |
|
|
|
|
|
It is compatible with: |
|
|
|
|
|
- Hugging Face Transformers, |
|
|
- CTranslate2, |
|
|
- ONNX Runtime, |
|
|
- MLX (Apple Silicon), |
|
|
- quantization-based acceleration pipelines. |
|
|
|
|
|
--- |
|
|
|
|
|
## 5. Methodology and Reproducibility |
|
|
|
|
|
`Gilbert-FR-Source` is used in strict research settings emphasizing: |
|
|
|
|
|
### **Reproducible training protocols** |
|
|
- frozen weights for baseline comparison, |
|
|
- controlled hyperparameter schedules, |
|
|
- consistent evaluation datasets, |
|
|
- deterministic decoding configurations. |
|
|
|
|
|
### **Evaluation methodology** |
|
|
WER is computed with standard normalization (lowercasing, punctuation removal). |
|
|
More advanced metrics (diarization error rate, long-context drift) are included in internal research pipelines. |
|
|
|
|
|
### **Versioning policy** |
|
|
This repository represents version `0.1` of the research baseline. |
|
|
All future fine-tuned models will explicitly reference this version for traceability. |
|
|
|
|
|
--- |
|
|
|
|
|
## 6. Limitations |
|
|
|
|
|
This baseline inherits the known limitations of Whisper and of the underlying datasets: |
|
|
|
|
|
- sensitivity to overlapping speech, |
|
|
- occasional hallucinations in long-form decoding, |
|
|
- domain shift on spontaneous dialogue, |
|
|
- potential biases related to accent distribution in training data, |
|
|
- suboptimal performance in telephony bandwidth. |
|
|
|
|
|
Understanding and quantifying these limitations is one of the core objectives of the Gilbert research roadmap. |
|
|
|
|
|
--- |
|
|
|
|
|
## 7. Future Work (Planned Research Directions) |
|
|
|
|
|
The following models will be developed as independent checkpoints: |
|
|
|
|
|
- **Gilbert-FR-Longform-v1** |
|
|
Long meetings, multi-speaker interaction, discourse-level context stability. |
|
|
|
|
|
- **Gilbert-FR-Accents-v1** |
|
|
Robustness to regional and international French accents. |
|
|
|
|
|
- **Gilbert-FR-Telephone-v1** |
|
|
Optimized for 8 kHz VoIP/call-center speech. |
|
|
|
|
|
- **Gilbert-Multilingual-v1** |
|
|
Extended cross-lingual performance with optimized French anchors. |
|
|
|
|
|
Each model will include detailed evaluation reports and will adhere to research reproducibility standards. |
|
|
|
|
|
--- |
|
|
|
|
|
## 8. License |
|
|
|
|
|
This repository includes files distributed under the MIT License. |
|
|
|
|
|
> A copy of the MIT License is included. |
|
|
> Some files were originally released under MIT. |
|
|
|
|
|
All future Gilbert models built on top of this baseline are the exclusive property of Lexia France. |
|
|
|
|
|
--- |
|
|
|
|
|
## 9. Contact |
|
|
|
|
|
For research collaboration, evaluation access, or technical inquiries: |
|
|
|
|
|
- Website: https://gilbert-assistant.fr |
|
|
- Email: [email protected] |
|
|
|