MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs Paper • 2511.07250 • Published 10 days ago • 17
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation Paper • 2510.24821 • Published 23 days ago • 34
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues Paper • 2510.17722 • Published Oct 20 • 19
IF-VidCap: Can Video Caption Models Follow Instructions? Paper • 2510.18726 • Published about 1 month ago • 24
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs Paper • 2510.18876 • Published about 1 month ago • 35
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM Paper • 2510.15870 • Published Oct 17 • 88
COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes Paper • 2510.14763 • Published Oct 16 • 13
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning Paper • 2510.10518 • Published Oct 12 • 17
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions Paper • 2510.10666 • Published Oct 12 • 27
ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems Paper • 2510.11652 • Published Oct 13 • 28
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding Paper • 2510.11498 • Published Oct 13 • 10
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Paper • 2510.10689 • Published Oct 12 • 46
StreamingVLM: Real-Time Understanding for Infinite Video Streams Paper • 2510.09608 • Published Oct 10 • 50
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? Paper • 2510.08189 • Published Oct 9 • 26
UniVideo: Unified Understanding, Generation, and Editing for Videos Paper • 2510.08377 • Published Oct 9 • 70
Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution Paper • 2509.25301 • Published Sep 29 • 17