SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding Paper • 2511.04668 • Published 2 days ago • 3
RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies Paper • 2510.17950 • Published 19 days ago • 4
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization Paper • 2510.25616 • Published 10 days ago • 88
TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System Paper • 2511.02832 • Published 4 days ago • 8
Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process Paper • 2511.01718 • Published 5 days ago • 6
MotionStream: Real-Time Video Generation with Interactive Motion Controls Paper • 2511.01266 • Published 5 days ago • 20
NaviTrace: Evaluating Embodied Navigation of Vision-Language Models Paper • 2510.26909 • Published 9 days ago • 13
World Simulation with Video Foundation Models for Physical AI Paper • 2511.00062 • Published 11 days ago • 35
PHUMA: Physically-Grounded Humanoid Locomotion Dataset Paper • 2510.26236 • Published 9 days ago • 27
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning Paper • 2510.27606 • Published 8 days ago • 27
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model Paper • 2510.27607 • Published 8 days ago • 8
π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models Paper • 2510.25889 • Published 10 days ago • 58
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats Paper • 2510.25602 • Published 10 days ago • 63
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model Paper • 2510.24992 • Published 11 days ago • 2
CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning Paper • 2510.22282 • Published 14 days ago • 2