Dolphin / README.md
JusperLee's picture
Update README.md
6e91f72 verified

A newer version of the Gradio SDK is available: 5.49.1

Upgrade
metadata
license: apache-2.0
title: >-
  Dolphin: Efficient Audio-Visual Speech Separation with Discrete Lip Semantics
  and Multi-Scale Global-Local Attention
sdk: gradio
emoji: ๐Ÿ‘€
colorFrom: blue
colorTo: blue

Dolphin Logo

Dolphin: Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

Kai Li*, Kejun Gao*, Xiaolin Hu
Tsinghua University

่ฎฟๅฎข็ปŸ่ฎก GitHub stars Static Badge arXiv Paper Hugging Face Models Gradio Live Demo

Dolphin is an efficient audio-visual speech separation framework that leverages discrete lip semantics and globalโ€“local attention to achieve state-of-the-art performance with significantly reduced computational complexity.

๐ŸŽฏ Highlights

  • Balanced Quality & Efficiency: Single-pass separator achieves state-of-the-art AVSS performance without iterative refinement.
  • DP-LipCoder: Dual-path, vector-quantized video encoder produces discrete audio-aligned semantic tokens while staying lightweight.
  • Globalโ€“Local Attention: TDANet-based separator augments each layer with coarse global self-attention and heat diffusion local attention.
  • Edge-Friendly Deployment: Delivers >50% parameter reduction, >2.4ร— lower MACs, and >6ร— faster GPU inference versus IIANet.

๐Ÿ’ฅ News

  • [2025-09-28] Code and pre-trained models are released! ๐Ÿ“ฆ

๐Ÿ“œ Abstract

Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech in noisy acoustic environments, but most existing systems remain computationally heavy. Dolphin tackles this tension by combining a lightweight, dual-path video encoder with a single-pass globalโ€“local collaborative separator. The video pathway, DP-LipCoder, maps lip movements into discrete semantic tokens that remain tightly aligned with audio through vector quantization and distillation from AV-HuBERT. The audio separator builds upon TDANet and injects globalโ€“local attention (GLA) blocksโ€”coarse-grained self-attention for long-range context and heat diffusion attention for denoising fine details. Across three public AVSS benchmarks, Dolphin not only outperforms the state-of-the-art IIANet on separation metrics but also delivers over 50% fewer parameters, more than 2.4ร— lower MACs, and over 6ร— faster GPU inference, making it practical for edge deployment.

๐ŸŒ Motivation

In real-world environments, target speech is often masked by background noise and interfering speakers. This phenomenon reflects the classic โ€œcocktail party effect,โ€ where listeners selectively attend to a single speaker within a noisy scene (Cherry, 1953). These challenges have spurred extensive research on speech separation.

Audio-only approaches tend to struggle in complex acoustic conditions, while the integration of synchronous visual cues offers greater robustness. Recent deep learning-based AVSS systems achieve strong performance, yet many rely on computationally intensive separators or heavy iterative refinement, limiting their practicality.

Beyond the separator itself, AVSS models frequently inherit high computational cost from their video encoders. Large-scale lip-reading backbones provide rich semantic alignment but bring prohibitive parameter counts. Compressing them often erodes lip semantics, whereas designing new lightweight encoders from scratch risks losing semantic fidelity and degrading separation quality. Building a video encoder that balances compactness with semantic alignment therefore remains a central challenge for AVSS.

๐Ÿง  Method Overview

To address these limitations, Dolphin introduces a novel AVSS pipeline centered on two components:

  • DP-LipCoder: A dual-path, vector-quantized video encoder that separates compressed visual structure from audio-aligned semantics. By combining vector quantization with knowledge distillation from AV-HuBERT, it converts continuous lip motion into discrete semantic tokens without sacrificing representational capacity.
  • Single-Pass GLA Separator: A lightweight TDANet-based audio separator that removes the need for iterative refinement. Each layer hosts a globalโ€“local attention block: coarse-grained self-attention captures long-range dependencies at low resolution, while heat diffusion attention smooths features across channels to suppress noise and retain detail.

Together, these components strike a balance between separation quality and computational efficiency, enabling deployment in resource-constrained scenarios.

๐Ÿงช Experimental Highlights

We evaluate Dolphin on LRS2, LRS3, and VoxCeleb2. Compared with the state-of-the-art IIANet, Dolphin achieves higher scores across all separation metrics while dramatically reducing resource consumption:

  • Parameters: >50% reduction
  • Computation: >2.4ร— decrease in MACs
  • Inference: >6ร— speedup on GPU

These results demonstrate that Dolphin provides competitive AVSS quality on edge hardware without heavy iterative processing.

๐Ÿ—๏ธ Architecture

Dolphin Architecture

The overall architecture of Dolphin.

Video Encoder

Dolphin Architecture

The video encoder of Dolphin.

Dolphin Model Overview

Dolphin Architecture

The overall architecture of Dolphin's separator.

Key Components

Dolphin Architecture

  1. Global Attention (GA) Block

    • Applies coarse-grained self-attention to capture long-range structure
    • Operates at low spatial resolution for efficiency
    • Enhances robustness to complex acoustic mixtures
  2. Local Attention (LA) Block

    • Uses heat diffusion attention to smooth features across channels
    • Suppresses background noise while preserving details
    • Complements GA to balance global context and local fidelity

๐Ÿ“Š Results

Performance Comparison

Performance metrics on three public AVSS benchmark datasets. Bold indicates best performance.

Results Table

Efficiency Analysis

Efficiency Comparison

Dolphin achieves:

  • โœ… >50% parameter reduction
  • โœ… 2.4ร— lower computational cost (MACs)
  • โœ… 6ร— faster GPU inference speed
  • โœ… Superior separation quality across all metrics

๐Ÿ“ฆ Installation

git clone https://github.com/JusperLee/Dolphin.git
cd Dolphin
pip install torch torchvision
pip install -r requirements.txt

Requirements

  • Python >= 3.10
  • PyTorch >= 2.5.0
  • CUDA >= 12.4
  • Other dependencies in requirements.txt

๐Ÿš€ Quick Start

Inference with Pre-trained Model

# Single audio-visual separation
python inference.py \
    --input /path/to/video.mp4 \
    --output /path/to/output/directory \
    --speakers 2 \
    --detect-every-n 8 \
    --face-scale 1.5 \
    --cuda-device 0 \
    --config checkpoints/vox2/conf.yml

๐Ÿ“ Model Zoo

Model Training Data SI-SNRi PESQ Download
Dolphin VoxCeleb2 16.1 dB 3.45 Link

๐Ÿ“– Citation

If you find Dolphin useful in your research, please cite:

@misc{li2025efficientaudiovisualspeechseparation,
      title={Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention}, 
      author={Kai Li and Kejun Gao and Xiaolin Hu},
      year={2025},
      eprint={2509.23610},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2509.23610}, 
}

๐Ÿค Acknowledgments

We thank the authors of IIANet and SepReformer for providing parts of the code used in this project.

๐Ÿ“ง Contact

For questions and feedback, please open an issue on GitHub or contact us at: [email protected]

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Made with stars โญ๏ธ for efficient audio-visual speech separation