Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Abstract
UniSandbox evaluates Unified Multimodal Models, revealing a gap between understanding and generation, and identifies Chain-of-Thought and self-training as means to bridge this gap.
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox
Community
UniSandbox is on live!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GIR-Bench: Versatile Benchmark for Generating Images with Reasoning (2025)
- SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models (2025)
- Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval (2025)
- RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark (2025)
- From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models (2025)
- Generative Universal Verifier as Multimodal Meta-Reasoner (2025)
- Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper