Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published 13 days ago • 191
RoboOmni: Proactive Robot Manipulation in Omni-modal Context Paper • 2510.23763 • Published 23 days ago • 53
EO-Robotics Collection EmbodiedOneVision is a unified framework for multimodal embodied reasoning and robot control, featuring interleaved vision-text-action pretraining. • 5 items • Updated Sep 16 • 8
Unicorn: Text-Only Data Synthesis for Vision Language Model Training Paper • 2503.22655 • Published Mar 28 • 39
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation Paper • 2503.06053 • Published Mar 8 • 138
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models Paper • 2407.05131 • Published Jul 6, 2024 • 27