FrWhisper
Model Description
FrWhisper is a fine-tuned version of OpenAI's Whisper Large V3 model, specifically optimized for French speech recognition with enhanced capabilities for transcribing interjections, hesitations, word repetitions, and interrupted words in conversational French.
This model was trained on a combination of two major French speech corpora:
- LangAge Corpus: Demographically-structured conversational French data
- ESLO (Enquêtes Sociolinguistiques à Orléans): Additional French conversational data for specific age groups (26-46 years, 65+ years)
The model is particularly well-suited for applications requiring detailed transcription of spontaneous speech, including discourse markers, hesitations, and other paralinguistic features that are typically ignored by standard ASR systems.
Key Features
- Enhanced Interjection Recognition: Accurately transcribes French interjections like "euh", "ah", "hé", "hein", etc.
- Hesitation Patterns: Captures natural speech hesitations and fillers
- Conversational Speech: Optimized for informal, spontaneous French speech
- Demographic Coverage: Trained on diverse speaker demographics and age groups
- Robust Performance: Significant improvement over base Whisper Large V3
Performance
Word Error Rate (WER) Comparison with Whisper Large V3
WER is computed on LangAge and ESLO, containing interjections, hesitations. This dataset is challenging for ASR models, because it contains many recordings of older people with different voice quality, and because features of spontaneous speech are difficult to transcribe.
Lower WER indicates better performance.
| Dataset Category | Sample Size | Whisper Large V3 | FrWhisper | Improvement |
|---|---|---|---|---|
| All Data | 196,923 | 99.08% | 84.90% | 14.18pp |
| LangAge Only | 69,176 | 78.43% | 67.21% | 11.22pp |
| ESLO Only | 127,747 | 110.25% | 94.47% | 15.78pp |
| Training Data | 195,183 | 98.98% | 84.64% | 14.34pp |
| Test Data | 195,183 | 98.98% | 84.64% | 14.34pp |
pp = percentage points improvement
Key Performance Highlights
- 14.18 percentage points overall WER improvement compared to Whisper Large V3
- Median WER for LangAge data: Improved from 100% to 44.44%
- Consistent performance across both training and test data (no overfitting)
Training Details
Model Architecture
- Base Model: OpenAI Whisper Large V3
- Model Size: 1550M parameters
- Audio Processing: 16kHz sampling rate, Log-Mel Spectrograms (128 mel bins)
Training Data
- Combined Dataset: LangAge + ESLO corpora
- Audio Preprocessing: Resampled from 44.1kHz to 16kHz
- Data Cleaning:
- Removal of segments < 100ms
- Filtering of silent audio segments
- Exclusion of noise patterns
Special Features Captured
The model is trained to recognize and transcribe:
- Interjections: ah, bah, beh, ben, chh, eh, euh, ha, hé, hein, hop, hum, m-hm, mmh, mm, oh, ouf, pff, youh
- Word Repetitions and Hesitations: Natural speech disfluencies
- Interrupted Words: Partial word utterances
Usage
Using Transformers
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load model and processor
processor = WhisperProcessor.from_pretrained("aihpi/FrWhisper")
model = WhisperForConditionalGeneration.from_pretrained("aihpi/FrWhisper")
# Process audio
def transcribe_french(audio_path):
# Load and preprocess audio (16kHz)
import librosa
audio, sr = librosa.load(audio_path, sr=16000)
# Process with model
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
# Generate transcription
with torch.no_grad():
predicted_ids = model.generate(input_features, language="fr", task="transcribe")
# Decode results
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
return transcription
# Example usage
transcription = transcribe_french("french_audio.wav")
print(transcription)
Expected Output
Unlike standard ASR models, FrWhisper will include natural speech elements:
Standard Whisper: "Alors je vais commencer par des petites questions tout à fait simples n'est-ce pas"
FrWhisper: "euh alors je vais commencer par des petites questions tout à fait simples n'est-ce pas"
Intended Use Cases
- Linguistic Research: Research requiring authentic speech representation
- Qualitative Research: Detailed transcription preserving speech patterns
- Interview Transcription: Academic and professional interview documentation
Limitations and Considerations
- Specialized Domain: Optimized for conversational French; may not perform as well on formal speech
- Interjection Focus: Higher WER than standard models when interjections are not desired
- Audio Quality: Best performance on clear audio recordings (similar to training data quality)
Citation
If you use this model in your research, please cite:
@misc{frwhisper2025,
title={FrWhisper: Fine-tuned Whisper for French Conversational Speech with Interjections},
author={Hanno Müller, Annette Gerstenberg},
year={2025}
}
Acknowledgments
The authors acknowledge the financial support by the German Federal Ministry of Research, Technology and Space (BMFTR) through the project «KI-Servicezentrum Berlin Brandenburg» (16IS22092).
Model Card Contact
For questions about this model, please open an issue in the model repository or contact [[email protected]].
- Downloads last month
- 80
Model tree for aihpi/FrWhisper
Base model
openai/whisper-large-v3Evaluation results
- Word Error Rate (All Data) on LangAge + ESLO Combinedself-reported84.900
- Word Error Rate (LangAge) on LangAge + ESLO Combinedself-reported67.210
- Word Error Rate (ESLO) on LangAge + ESLO Combinedself-reported94.470