gilbert-fr-source / README.md

Update README.md

69442e3 verified 13 days ago

6.22 kB

	---
	license: mit
	datasets:
	- google/fleurs
	- facebook/voxpopuli
	- facebook/multilingual_librispeech
	- mozilla-foundation/common_voice_13_0
	- mozilla-foundation/common_voice_17_0
	language:
	- fr
	- en
	metrics:
	- wer
	base_model:
	- openai/whisper-large-v3
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	tags:
	- speech-recognition
	- whisper
	- french
	- stt
	- multilingual
	- research
	- gilbert
	---

	# Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition

	`Gilbert-FR-Source` is a French automatic speech recognition (ASR) model used as the research foundation for the Gilbert project.
	It is designed as an internal scientific baseline enabling controlled experimentation, reproducible evaluation, and rigorous comparison across ASR architectures, datasets, and adaptation methods.

	This model is not a fine-tuned derivative, but a curated research anchor used to support systematic studies in:

	- domain adaptation,
	- robustness to spontaneous and long-form speech,
	- accented and low-resource linguistic profiles,
	- telephony and bandwidth-constrained speech,
	- multi-speaker and meeting transcription.

	---

	## 1. Research Motivation

	The Gilbert project aims to build highly specialized ASR systems optimized for:

	- professional meeting transcription (hybrid/remote),
	- long-form multi-speaker discourse,
	- institutional environments (education, public sector),
	- constrained audio conditions (telephony, VoIP, low SNR),
	- sociolinguistic diversity (African, Canadian, Belgian and other French accents).

	While Whisper Large V3 provides strong baseline performance, its behavior under domain shifts (spontaneous interactions, overlapping speech, degraded microphones) requires systematic study.
	`Gilbert-FR-Source` provides the frozen starting point for this line of research, ensuring controlled comparisons between experiments.

	---

	## 2. Scientific Goals and Research Questions

	This model is used to answer a series of research questions:

	### Q1. Long-form modeling
	How does Whisper-L3 behave on meetings lasting 30–120 minutes, with natural topic shifts, interruptions, and pragmatic markers?

	### Q2. Accent robustness
	Which classes of French accents induce the strongest WER degradation?
	How does robustness vary across FLEURS, African French, and Common Voice subsets?

	### Q3. Telephony adaptation
	What is the degradation curve when downsampling to 16 kHz / 8 kHz / μ-law compressed audio?

	### Q4. Domain adaptation efficiency
	What is the marginal gain of targeted fine-tuning on professional meeting datasets (education, administration, healthcare)?

	### Q5. Multilingual side-effects
	To what extent does French fine-tuning affect cross-lingual generalization?

	These research axes structure the development of future specialized Gilbert models.

	---

	## 3. Benchmark Reference Results

	The following WER scores originate from established open benchmarks and serve as a reference baseline for future experiments:

	\| Dataset \| WER \|
	\|--------\|-----\|
	\| MLS (FR) \| 3.98 % \|
	\| Common Voice FR (v13.0) \| 7.28 % \|
	\| VoxPopuli (FR) \| 8.91 % \|
	\| Fleurs (FR) \| 4.84 % \|
	\| African Accented French \| 4.20 % \|

	These results provide upper bounds before targeted fine-tuning.
	Future Gilbert variants will be evaluated using:

	- internal meeting datasets,
	- domain-specific corpora (administration, higher education, healthcare),
	- accented speech corpora,
	- telephony datasets,
	- long-form evaluation methods (> 1 hour audio).

	---

	## 4. Architecture

	The model is based on the Whisper Large V3 encoder–decoder architecture, offering:

	- large multilingual pretraining,
	- long-context modeling capacity,
	- robust cross-lingual alignment,
	- stable decoding for long outputs,
	- strong zero-shot performance on French.

	It is compatible with:

	- Hugging Face Transformers,
	- CTranslate2,
	- ONNX Runtime,
	- MLX (Apple Silicon),
	- quantization-based acceleration pipelines.

	---

	## 5. Methodology and Reproducibility

	`Gilbert-FR-Source` is used in strict research settings emphasizing:

	### Reproducible training protocols
	- frozen weights for baseline comparison,
	- controlled hyperparameter schedules,
	- consistent evaluation datasets,
	- deterministic decoding configurations.

	### Evaluation methodology
	WER is computed with standard normalization (lowercasing, punctuation removal).
	More advanced metrics (diarization error rate, long-context drift) are included in internal research pipelines.

	### Versioning policy
	This repository represents version `0.1` of the research baseline.
	All future fine-tuned models will explicitly reference this version for traceability.

	---

	## 6. Limitations

	This baseline inherits the known limitations of Whisper and of the underlying datasets:

	- sensitivity to overlapping speech,
	- occasional hallucinations in long-form decoding,
	- domain shift on spontaneous dialogue,
	- potential biases related to accent distribution in training data,
	- suboptimal performance in telephony bandwidth.

	Understanding and quantifying these limitations is one of the core objectives of the Gilbert research roadmap.

	---

	## 7. Future Work (Planned Research Directions)

	The following models will be developed as independent checkpoints:

	- Gilbert-FR-Longform-v1
	Long meetings, multi-speaker interaction, discourse-level context stability.

	- Gilbert-FR-Accents-v1
	Robustness to regional and international French accents.

	- Gilbert-FR-Telephone-v1
	Optimized for 8 kHz VoIP/call-center speech.

	- Gilbert-Multilingual-v1
	Extended cross-lingual performance with optimized French anchors.

	Each model will include detailed evaluation reports and will adhere to research reproducibility standards.

	---

	## 8. License

	This repository includes files distributed under the MIT License.

	> A copy of the MIT License is included.
	> Some files were originally released under MIT.

	All future Gilbert models built on top of this baseline are the exclusive property of Lexia France.

	---

	## 9. Contact

	For research collaboration, evaluation access, or technical inquiries:

	- Website: https://gilbert-assistant.fr
	- Email: [email protected]