FrWhisper

Model Description

FrWhisper is a fine-tuned version of OpenAI's Whisper Large V3 model, specifically optimized for French speech recognition with enhanced capabilities for transcribing interjections, hesitations, word repetitions, and interrupted words in conversational French.

This model was trained on a combination of two major French speech corpora:

  • LangAge Corpus: Demographically-structured conversational French data
  • ESLO (Enquêtes Sociolinguistiques à Orléans): Additional French conversational data for specific age groups (26-46 years, 65+ years)

The model is particularly well-suited for applications requiring detailed transcription of spontaneous speech, including discourse markers, hesitations, and other paralinguistic features that are typically ignored by standard ASR systems.

Key Features

  • Enhanced Interjection Recognition: Accurately transcribes French interjections like "euh", "ah", "hé", "hein", etc.
  • Hesitation Patterns: Captures natural speech hesitations and fillers
  • Conversational Speech: Optimized for informal, spontaneous French speech
  • Demographic Coverage: Trained on diverse speaker demographics and age groups
  • Robust Performance: Significant improvement over base Whisper Large V3

Performance

Word Error Rate (WER) Comparison with Whisper Large V3

WER is computed on LangAge and ESLO, containing interjections, hesitations. This dataset is challenging for ASR models, because it contains many recordings of older people with different voice quality, and because features of spontaneous speech are difficult to transcribe.

Lower WER indicates better performance.

Dataset Category Sample Size Whisper Large V3 FrWhisper Improvement
All Data 196,923 99.08% 84.90% 14.18pp
LangAge Only 69,176 78.43% 67.21% 11.22pp
ESLO Only 127,747 110.25% 94.47% 15.78pp
Training Data 195,183 98.98% 84.64% 14.34pp
Test Data 195,183 98.98% 84.64% 14.34pp

pp = percentage points improvement

Key Performance Highlights

  • 14.18 percentage points overall WER improvement compared to Whisper Large V3
  • Median WER for LangAge data: Improved from 100% to 44.44%
  • Consistent performance across both training and test data (no overfitting)

Training Details

Model Architecture

  • Base Model: OpenAI Whisper Large V3
  • Model Size: 1550M parameters
  • Audio Processing: 16kHz sampling rate, Log-Mel Spectrograms (128 mel bins)

Training Data

  • Combined Dataset: LangAge + ESLO corpora
  • Audio Preprocessing: Resampled from 44.1kHz to 16kHz
  • Data Cleaning:
    • Removal of segments < 100ms
    • Filtering of silent audio segments
    • Exclusion of noise patterns

Special Features Captured

The model is trained to recognize and transcribe:

  • Interjections: ah, bah, beh, ben, chh, eh, euh, ha, hé, hein, hop, hum, m-hm, mmh, mm, oh, ouf, pff, youh
  • Word Repetitions and Hesitations: Natural speech disfluencies
  • Interrupted Words: Partial word utterances

Usage

Using Transformers

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model and processor
processor = WhisperProcessor.from_pretrained("aihpi/FrWhisper")
model = WhisperForConditionalGeneration.from_pretrained("aihpi/FrWhisper")

# Process audio
def transcribe_french(audio_path):
    # Load and preprocess audio (16kHz)
    import librosa
    audio, sr = librosa.load(audio_path, sr=16000)
    
    # Process with model
    input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(input_features, language="fr", task="transcribe")
    
    # Decode results
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# Example usage
transcription = transcribe_french("french_audio.wav")
print(transcription)

Expected Output

Unlike standard ASR models, FrWhisper will include natural speech elements:

Standard Whisper: "Alors je vais commencer par des petites questions tout à fait simples n'est-ce pas"

FrWhisper: "euh alors je vais commencer par des petites questions tout à fait simples n'est-ce pas"

Intended Use Cases

  • Linguistic Research: Research requiring authentic speech representation
  • Qualitative Research: Detailed transcription preserving speech patterns
  • Interview Transcription: Academic and professional interview documentation

Limitations and Considerations

  • Specialized Domain: Optimized for conversational French; may not perform as well on formal speech
  • Interjection Focus: Higher WER than standard models when interjections are not desired
  • Audio Quality: Best performance on clear audio recordings (similar to training data quality)

Citation

If you use this model in your research, please cite:

@misc{frwhisper2025,
  title={FrWhisper: Fine-tuned Whisper for French Conversational Speech with Interjections},
  author={Hanno Müller, Annette Gerstenberg},
  year={2025}
}

Acknowledgments

The authors acknowledge the financial support by the German Federal Ministry of Research, Technology and Space (BMFTR) through the project «KI-Servicezentrum Berlin Brandenburg» (16IS22092).

Model Card Contact

For questions about this model, please open an issue in the model repository or contact [[email protected]].

Downloads last month
80
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aihpi/FrWhisper

Finetuned
(652)
this model

Evaluation results

  • Word Error Rate (All Data) on LangAge + ESLO Combined
    self-reported
    84.900
  • Word Error Rate (LangAge) on LangAge + ESLO Combined
    self-reported
    67.210
  • Word Error Rate (ESLO) on LangAge + ESLO Combined
    self-reported
    94.470