FrWhisper

Model Description

FrWhisper is a fine-tuned version of OpenAI's Whisper Large V3 model, specifically optimized for French speech recognition with enhanced capabilities for transcribing interjections, hesitations, word repetitions, and interrupted words in conversational French.

This model was trained on a combination of two major French speech corpora:

LangAge Corpus: Demographically-structured conversational French data
ESLO (Enquêtes Sociolinguistiques à Orléans): Additional French conversational data for specific age groups (26-46 years, 65+ years)

The model is particularly well-suited for applications requiring detailed transcription of spontaneous speech, including discourse markers, hesitations, and other paralinguistic features that are typically ignored by standard ASR systems.

Key Features

Enhanced Interjection Recognition: Accurately transcribes French interjections like "euh", "ah", "hé", "hein", etc.
Hesitation Patterns: Captures natural speech hesitations and fillers
Conversational Speech: Optimized for informal, spontaneous French speech
Demographic Coverage: Trained on diverse speaker demographics and age groups
Robust Performance: Significant improvement over base Whisper Large V3

Performance

Word Error Rate (WER) Comparison with Whisper Large V3

WER is computed on LangAge and ESLO, containing interjections, hesitations. This dataset is challenging for ASR models, because it contains many recordings of older people with different voice quality, and because features of spontaneous speech are difficult to transcribe.

Lower WER indicates better performance.

Dataset Category	Sample Size	Whisper Large V3	FrWhisper	Improvement
All Data	196,923	99.08%	84.90%	14.18pp
LangAge Only	69,176	78.43%	67.21%	11.22pp
ESLO Only	127,747	110.25%	94.47%	15.78pp
Training Data	195,183	98.98%	84.64%	14.34pp
Test Data	195,183	98.98%	84.64%	14.34pp

pp = percentage points improvement

Key Performance Highlights

14.18 percentage points overall WER improvement compared to Whisper Large V3
Median WER for LangAge data: Improved from 100% to 44.44%
Consistent performance across both training and test data (no overfitting)

Training Details

Model Architecture

Base Model: OpenAI Whisper Large V3
Model Size: 1550M parameters
Audio Processing: 16kHz sampling rate, Log-Mel Spectrograms (128 mel bins)

Training Data

Combined Dataset: LangAge + ESLO corpora
Audio Preprocessing: Resampled from 44.1kHz to 16kHz
Data Cleaning:
- Removal of segments < 100ms
- Filtering of silent audio segments
- Exclusion of noise patterns

Special Features Captured

The model is trained to recognize and transcribe:

Interjections: ah, bah, beh, ben, chh, eh, euh, ha, hé, hein, hop, hum, m-hm, mmh, mm, oh, ouf, pff, youh
Word Repetitions and Hesitations: Natural speech disfluencies
Interrupted Words: Partial word utterances

Usage

Using Transformers

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model and processor
processor = WhisperProcessor.from_pretrained("aihpi/FrWhisper")
model = WhisperForConditionalGeneration.from_pretrained("aihpi/FrWhisper")

# Process audio
def transcribe_french(audio_path):
    # Load and preprocess audio (16kHz)
    import librosa
    audio, sr = librosa.load(audio_path, sr=16000)
    
    # Process with model
    input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(input_features, language="fr", task="transcribe")
    
    # Decode results
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# Example usage
transcription = transcribe_french("french_audio.wav")
print(transcription)

Expected Output

Unlike standard ASR models, FrWhisper will include natural speech elements:

Standard Whisper: "Alors je vais commencer par des petites questions tout à fait simples n'est-ce pas"

FrWhisper: "euh alors je vais commencer par des petites questions tout à fait simples n'est-ce pas"

Intended Use Cases

Linguistic Research: Research requiring authentic speech representation
Qualitative Research: Detailed transcription preserving speech patterns
Interview Transcription: Academic and professional interview documentation

Limitations and Considerations

Specialized Domain: Optimized for conversational French; may not perform as well on formal speech
Interjection Focus: Higher WER than standard models when interjections are not desired
Audio Quality: Best performance on clear audio recordings (similar to training data quality)

Citation

If you use this model in your research, please cite:

@misc{frwhisper2025,
  title={FrWhisper: Fine-tuned Whisper for French Conversational Speech with Interjections},
  author={Hanno Müller, Annette Gerstenberg},
  year={2025}
}

Acknowledgments

The authors acknowledge the financial support by the German Federal Ministry of Research, Technology and Space (BMFTR) through the project «KI-Servicezentrum Berlin Brandenburg» (16IS22092).

Model Card Contact

For questions about this model, please open an issue in the model repository or contact [[email protected]].

Downloads last month: 80

Safetensors

Model size

2B params

Tensor type

F32

Model tree for aihpi/FrWhisper

Base model

openai/whisper-large-v3

Finetuned

(652)

this model

Evaluation results

Word Error Rate (All Data) on LangAge + ESLO Combined
self-reported

84.900
Word Error Rate (LangAge) on LangAge + ESLO Combined
self-reported

67.210
Word Error Rate (ESLO) on LangAge + ESLO Combined
self-reported

94.470

View on Papers With Code