MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs Paper • 2511.07250 • Published 9 days ago • 17
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs Paper • 2511.07250 • Published 9 days ago • 17 • 2
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation Paper • 2510.24821 • Published 22 days ago • 34
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues Paper • 2510.17722 • Published about 1 month ago • 19
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues Paper • 2510.17722 • Published about 1 month ago • 19 • 2
IF-VidCap: Can Video Caption Models Follow Instructions? Paper • 2510.18726 • Published 29 days ago • 24
IF-VidCap: Can Video Caption Models Follow Instructions? Paper • 2510.18726 • Published 29 days ago • 24 • 2
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs Paper • 2510.18876 • Published 29 days ago • 35
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM Paper • 2510.15870 • Published Oct 17 • 87
COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes Paper • 2510.14763 • Published Oct 16 • 13
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning Paper • 2510.10518 • Published Oct 12 • 17
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning Paper • 2510.10518 • Published Oct 12 • 17 • 2
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions Paper • 2510.10666 • Published Oct 12 • 27
ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems Paper • 2510.11652 • Published Oct 13 • 28
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding Paper • 2510.11498 • Published Oct 13 • 10