How to Evaluate Speech Translation with Source-Aware Neural MT Metrics
Abstract
Source-aware metrics using ASR transcripts and back-translations improve speech-to-text evaluation by addressing alignment issues and incorporating source information.
Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.
Community
We tackle how to use source-aware neural MT metrics for evaluating speech translation, even without transcripts. By generating synthetic sources (from ASR transcripts or back-translations--BTs) and introducing a cross-lingual re-segmentation algorithm, we enable robust evaluation directly from speech. ASR sources excel when WER < 20%, while BTs remain a cheaper, solid fallback, paving the way for more accurate, source-informed ST evaluation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages (2025)
- Extending Automatic Machine Translation Evaluation to Book-Length Documents (2025)
- Long-context Reference-based MT Quality Estimation (2025)
- Evaluating Language Translation Models by Playing Telephone (2025)
- Whisper-UT: A Unified Translation Framework for Speech and Text (2025)
- StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation (2025)
- Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper