Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Abstract
Systematic study reveals that naive action fine-tuning degrades visual representations in Vision-Language-Action models, but targeted strategies can mitigate this and improve generalization.
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
Community
ArXiv: https://arxiv.org/abs/2510.25616
Project Page: https://blind-vla-paper.github.io/
Code https://github.com/CognitiveAISystems/BlindVLA
Action fine-tuning often blinds VLA models: they lose the visual–language (VL) priors that made them smart. We show how to keep those priors intact with a tiny alignment loss. 👀🤖
🔍 We probed VL representations of OpenVLA and observe 3 huge problems: (1) attention sink - attention maps become diffuse, noisy, and weakly correlated with the target object, (2) representation collapse - patch embeddings lose separability, (3) domain forgetting - VL understanding in symbolic/abstract categories degrade after naive SFT.
✨Following the Platonic Representation Hypothesis, we introduce Visual Representation Alignment. During SFT, we pull a VLA’s visual tokens toward a frozen teacher’s features (cosine sim) via a lightweight frozen projector. Keep perception anchored while learning to act. Think of it like learning to drive with a compass. The policy learns control, but the compass (teacher-aligned vision) prevents drift into shortcutty, brittle perception.
📊To measure the transfer of VL understanding and knowledge from VLMs to VLAs independently of low-level control, we built VL-Think - a minimal pick-and-place suite that isolates VL understanding from control. It probes symbols, colors, arrows, traffic/weather icons, etc., so drops reflect VL forgetting, not grasp skill. Visual Representation Alignment mitigates domain forgetting and boosts performance in Color and Shape tasks, even surpassing the PrismaticVLM upper bound. Yet limited data diversity and LoRA capacity hinder recovery of rarer VL concepts — a key direction for future work.
📈 We evaluate our approach in Simpler-based environments using the VL-Think and the benchmark introduced by RL4VLA measuring VLA generalization across Vision (textures, noise), Semantics (unseen objects, paraphrases, distractors), and Execution (randomized poses, object changes). Results: Align (ours) > naive SFT > Vision Encoder Freeze across OOD settings. Our alignment method yields consistent improvements across all evaluation axes. Linear probing on ImageNet-100 shows stronger features.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations (2025)
- Visual Representation Alignment for Multimodal Large Language Models (2025)
- QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models (2025)
- Contrastive Representation Regularization for Vision-Language-Action Models (2025)
- Igniting VLMs toward the Embodied Space (2025)
- VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation (2025)
- Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper