SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
Abstract
SAIL-RL enhances multimodal large language models' reasoning capabilities through a dual reward system, improving benchmarks and reducing hallucinations.
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.
Community
SAIL-RL
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models (2025)
- AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning (2025)
- Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment (2025)
- Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning (2025)
- Parallel-R1: Towards Parallel Thinking via Reinforcement Learning (2025)
- More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models (2025)
- From Faithfulness to Correctness: Generative Reward Models that Think Critically (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper