A newer version of the Gradio SDK is available:
5.49.1
license: apache-2.0
title: >-
Dolphin: Efficient Audio-Visual Speech Separation with Discrete Lip Semantics
and Multi-Scale Global-Local Attention
sdk: gradio
emoji: ๐
colorFrom: blue
colorTo: blue
Dolphin: Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
Kai Li*, Kejun Gao*, Xiaolin Hu
Tsinghua University
Dolphin is an efficient audio-visual speech separation framework that leverages discrete lip semantics and globalโlocal attention to achieve state-of-the-art performance with significantly reduced computational complexity.
๐ฏ Highlights
- Balanced Quality & Efficiency: Single-pass separator achieves state-of-the-art AVSS performance without iterative refinement.
- DP-LipCoder: Dual-path, vector-quantized video encoder produces discrete audio-aligned semantic tokens while staying lightweight.
- GlobalโLocal Attention: TDANet-based separator augments each layer with coarse global self-attention and heat diffusion local attention.
- Edge-Friendly Deployment: Delivers >50% parameter reduction, >2.4ร lower MACs, and >6ร faster GPU inference versus IIANet.
๐ฅ News
- [2025-09-28] Code and pre-trained models are released! ๐ฆ
๐ Abstract
Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech in noisy acoustic environments, but most existing systems remain computationally heavy. Dolphin tackles this tension by combining a lightweight, dual-path video encoder with a single-pass globalโlocal collaborative separator. The video pathway, DP-LipCoder, maps lip movements into discrete semantic tokens that remain tightly aligned with audio through vector quantization and distillation from AV-HuBERT. The audio separator builds upon TDANet and injects globalโlocal attention (GLA) blocksโcoarse-grained self-attention for long-range context and heat diffusion attention for denoising fine details. Across three public AVSS benchmarks, Dolphin not only outperforms the state-of-the-art IIANet on separation metrics but also delivers over 50% fewer parameters, more than 2.4ร lower MACs, and over 6ร faster GPU inference, making it practical for edge deployment.
๐ Motivation
In real-world environments, target speech is often masked by background noise and interfering speakers. This phenomenon reflects the classic โcocktail party effect,โ where listeners selectively attend to a single speaker within a noisy scene (Cherry, 1953). These challenges have spurred extensive research on speech separation.
Audio-only approaches tend to struggle in complex acoustic conditions, while the integration of synchronous visual cues offers greater robustness. Recent deep learning-based AVSS systems achieve strong performance, yet many rely on computationally intensive separators or heavy iterative refinement, limiting their practicality.
Beyond the separator itself, AVSS models frequently inherit high computational cost from their video encoders. Large-scale lip-reading backbones provide rich semantic alignment but bring prohibitive parameter counts. Compressing them often erodes lip semantics, whereas designing new lightweight encoders from scratch risks losing semantic fidelity and degrading separation quality. Building a video encoder that balances compactness with semantic alignment therefore remains a central challenge for AVSS.
๐ง Method Overview
To address these limitations, Dolphin introduces a novel AVSS pipeline centered on two components:
- DP-LipCoder: A dual-path, vector-quantized video encoder that separates compressed visual structure from audio-aligned semantics. By combining vector quantization with knowledge distillation from AV-HuBERT, it converts continuous lip motion into discrete semantic tokens without sacrificing representational capacity.
- Single-Pass GLA Separator: A lightweight TDANet-based audio separator that removes the need for iterative refinement. Each layer hosts a globalโlocal attention block: coarse-grained self-attention captures long-range dependencies at low resolution, while heat diffusion attention smooths features across channels to suppress noise and retain detail.
Together, these components strike a balance between separation quality and computational efficiency, enabling deployment in resource-constrained scenarios.
๐งช Experimental Highlights
We evaluate Dolphin on LRS2, LRS3, and VoxCeleb2. Compared with the state-of-the-art IIANet, Dolphin achieves higher scores across all separation metrics while dramatically reducing resource consumption:
- Parameters: >50% reduction
- Computation: >2.4ร decrease in MACs
- Inference: >6ร speedup on GPU
These results demonstrate that Dolphin provides competitive AVSS quality on edge hardware without heavy iterative processing.
๐๏ธ Architecture
The overall architecture of Dolphin.
Video Encoder
The video encoder of Dolphin.
Dolphin Model Overview
The overall architecture of Dolphin's separator.
Key Components
Global Attention (GA) Block
- Applies coarse-grained self-attention to capture long-range structure
- Operates at low spatial resolution for efficiency
- Enhances robustness to complex acoustic mixtures
Local Attention (LA) Block
- Uses heat diffusion attention to smooth features across channels
- Suppresses background noise while preserving details
- Complements GA to balance global context and local fidelity
๐ Results
Performance Comparison
Performance metrics on three public AVSS benchmark datasets. Bold indicates best performance.
Efficiency Analysis
Dolphin achieves:
- โ >50% parameter reduction
- โ 2.4ร lower computational cost (MACs)
- โ 6ร faster GPU inference speed
- โ Superior separation quality across all metrics
๐ฆ Installation
git clone https://github.com/JusperLee/Dolphin.git
cd Dolphin
pip install torch torchvision
pip install -r requirements.txt
Requirements
- Python >= 3.10
- PyTorch >= 2.5.0
- CUDA >= 12.4
- Other dependencies in requirements.txt
๐ Quick Start
Inference with Pre-trained Model
# Single audio-visual separation
python inference.py \
--input /path/to/video.mp4 \
--output /path/to/output/directory \
--speakers 2 \
--detect-every-n 8 \
--face-scale 1.5 \
--cuda-device 0 \
--config checkpoints/vox2/conf.yml
๐ Model Zoo
| Model | Training Data | SI-SNRi | PESQ | Download |
|---|---|---|---|---|
| Dolphin | VoxCeleb2 | 16.1 dB | 3.45 | Link |
๐ Citation
If you find Dolphin useful in your research, please cite:
@misc{li2025efficientaudiovisualspeechseparation,
title={Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention},
author={Kai Li and Kejun Gao and Xiaolin Hu},
year={2025},
eprint={2509.23610},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2509.23610},
}
๐ค Acknowledgments
We thank the authors of IIANet and SepReformer for providing parts of the code used in this project.
๐ง Contact
For questions and feedback, please open an issue on GitHub or contact us at: [email protected]
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Made with stars โญ๏ธ for efficient audio-visual speech separation





