Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark Paper • 2511.13853 • Published 2 days ago • 32
ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents Paper • 2511.07685 • Published 9 days ago • 6
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation Paper • 2511.09057 • Published 8 days ago • 66
Adaptive Multi-Agent Response Refinement in Conversational Systems Paper • 2511.08319 • Published 8 days ago • 39
Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale Paper • 2511.05705 • Published 12 days ago • 6
The Station: An Open-World Environment for AI-Driven Discovery Paper • 2511.06309 • Published 11 days ago • 34
IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction Paper • 2511.07327 • Published 9 days ago • 70
CodeClash: Benchmarking Goal-Oriented Software Engineering Paper • 2511.00839 • Published 18 days ago • 8
LTD-Bench: Evaluating Large Language Models by Letting Them Draw Paper • 2511.02347 • Published 16 days ago • 8
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats Paper • 2510.25602 • Published 21 days ago • 71
π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models Paper • 2510.25889 • Published 21 days ago • 62
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark Paper • 2510.26160 • Published 21 days ago • 15
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence Paper • 2510.23538 • Published 23 days ago • 95
Reasoning with Sampling: Your Base Model is Smarter Than You Think Paper • 2510.14901 • Published Oct 16 • 47
Unified Reinforcement and Imitation Learning for Vision-Language Models Paper • 2510.19307 • Published 29 days ago • 27
A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning Paper • 2510.15444 • Published Oct 17 • 145