From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published Oct 16 • 65
Learning GUI Grounding with Spatial Reasoning from Visual Feedback Paper • 2509.21552 • Published Sep 25 • 11
Sample-efficient Integration of New Modalities into Large Language Models Paper • 2509.04606 • Published Sep 4 • 8
Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models Paper • 2506.06006 • Published Jun 6 • 13
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Paper • 2502.13063 • Published Feb 18 • 72