arXiv:2510.25616

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

Published on Oct 29

· Submitted by

Nikita on Nov 5

#2 Paper of the day

Upvote

Authors:

Nikita Kachaev ,

Daniil Zelezetsky ,

Abstract

Systematic study reveals that naive action fine-tuning degrades visual representations in Vision-Language-Action models, but targeted strategies can mitigate this and improve generalization.

AI-generated summary

The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io

View arXiv page View PDF Project page GitHub 22 Add to collection

Community

tttonyalpha

Paper author Paper submitter 4 days ago

ArXiv: https://arxiv.org/abs/2510.25616
Project Page: https://blind-vla-paper.github.io/
Code https://github.com/CognitiveAISystems/BlindVLA

Action fine-tuning often blinds VLA models: they lose the visual–language (VL) priors that made them smart. We show how to keep those priors intact with a tiny alignment loss. 👀🤖

🔍 We probed VL representations of OpenVLA and observe 3 huge problems: (1) attention sink - attention maps become diffuse, noisy, and weakly correlated with the target object, (2) representation collapse - patch embeddings lose separability, (3) domain forgetting - VL understanding in symbolic/abstract categories degrade after naive SFT.
✨Following the Platonic Representation Hypothesis, we introduce Visual Representation Alignment. During SFT, we pull a VLA’s visual tokens toward a frozen teacher’s features (cosine sim) via a lightweight frozen projector. Keep perception anchored while learning to act. Think of it like learning to drive with a compass. The policy learns control, but the compass (teacher-aligned vision) prevents drift into shortcutty, brittle perception.
📊To measure the transfer of VL understanding and knowledge from VLMs to VLAs independently of low-level control, we built VL-Think - a minimal pick-and-place suite that isolates VL understanding from control. It probes symbols, colors, arrows, traffic/weather icons, etc., so drops reflect VL forgetting, not grasp skill. Visual Representation Alignment mitigates domain forgetting and boosts performance in Color and Shape tasks, even surpassing the PrismaticVLM upper bound. Yet limited data diversity and LoRA capacity hinder recovery of rarer VL concepts — a key direction for future work.
📈 We evaluate our approach in Simpler-based environments using the VL-Think and the benchmark introduced by RL4VLA measuring VLA generalization across Vision (textures, noise), Semantics (unseen objects, paraphrases, distractors), and Execution (randomized poses, object changes). Results: Align (ours) > naive SFT > Vision Encoder Freeze across OOD settings. Our alignment method yields consistent improvements across all evaluation axes. Linear probing on ImageNet-100 shows stronger features.