UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Abstract
UniAVGen, a unified framework using dual Diffusion Transformers and Asymmetric Cross-Modal Interaction, enhances audio-video generation by ensuring synchronization and consistency with fewer training samples.
Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.
Community
UniAVGen is a unified framework for high-fidelity joint audio-video generation, addressing key limitations of existing methods such as poor lip synchronization, insufficient semantic consistency, and limited task generalization.
At its core, UniAVGen adopts a symmetric dual-branch architecture (parallel Diffusion Transformers for audio and video) and introduces three critical innovations:
(1) Asymmetric Cross-Modal Interaction for bidirectional temporal alignment,
(2)Face-Aware Modulation to prioritize salient facial regions during interaction,
(3)Modality-Aware Classifier-Free Guidance to amplify cross-modal correlations during inference.
Project Page: https://mcg-nju.github.io/UniAVGen/
great
The code and checkpoint will come soon.
good job!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation (2025)
- HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning (2025)
- Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation (2025)
- BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration (2025)
- VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework (2025)
- Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction (2025)
- Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper