ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Abstract
ThinkMorph, a unified model fine-tuned on interleaved reasoning traces, enhances multimodal reasoning by generating coherent text-image steps, achieving significant performance gains and demonstrating emergent capabilities.
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
Community
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
The four tasks we interleave:
- π§© Jigsaw Assembly β rearrange patches with visual verification
- πΊοΈ Spatial Navigation β overlay and validate routes
- π Visual Search β draw precise boxes to ground answers
- π Chart Refocus β highlight regions, then compute
Result: +86.67% Nav, +38.75% Jigsaw, +34.74% avg.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning (2025)
- Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models (2025)
- Latent Visual Reasoning (2025)
- GIR-Bench: Versatile Benchmark for Generating Images with Reasoning (2025)
- MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning (2025)
- More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models (2025)
- BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper


