World Simulation with Video Foundation Models for Physical AI Paper • 2511.00062 • Published 21 days ago • 39
Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process Paper • 2511.01718 • Published 15 days ago • 6
NaviTrace: Evaluating Embodied Navigation of Vision-Language Models Paper • 2510.26909 • Published 19 days ago • 13
Do You Need Proprioceptive States in Visuomotor Policies? Paper • 2509.18644 • Published Sep 23 • 49
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19 • 54
FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies Paper • 2509.04996 • Published Sep 5 • 13
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models Paper • 2506.07961 • Published Jun 9 • 11
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence Paper • 2505.23747 • Published May 29 • 68
Emerging Properties in Unified Multimodal Pretraining Paper • 2505.14683 • Published May 20 • 134
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model Paper • 2504.08685 • Published Apr 11 • 130
FLOWER VLA Collection Collection of checkpoints for the FLOWER VLA policy. A small and versatile VLA for language-conditioned robot manipulation with less than 1B parameter • 10 items • Updated Sep 17 • 4
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k Paper • 2503.09642 • Published Mar 12 • 19
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20 • 154
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Paper • 2502.11089 • Published Feb 16 • 165
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation Paper • 2502.12148 • Published Feb 17 • 17