Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

This model is a LoRA adapter presented in the paper Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization.

To address the degradation of visual-language (VL) representations during VLA supervised fine-tuning (SFT), we introduce Visual Representation Alignment. During SFT, we pull a VLA’s visual tokens toward a frozen teacher’s patch features using cosine similarity through a lightweight frozen projector. This keeps perception anchored while the model learns to act — improving OOD generalization with almost no added cost.

method_1

Model Details

Model Description

The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. This work systematically studies representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, the authors probe VLA's hidden representations and analyze attention maps, and design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. They introduce a simple yet effective method, Visual Representation Alignment, that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios.

The paper also introduces the VL-Think Task Suite, a diagnostic suite assessing the transfer of VL understanding and knowledge from VLMs to VLAs independently of low-level control. This suite focuses on whether models retain the ability to interpret visual symbols, compositional cues, and categorical relations rather than pure manipulation skills.

  • Developed by: Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
  • Model type: Vision-Language-Action (VLA) model (LoRA adapter)
  • Language(s): English
  • License: Apache-2.0
  • Finetuned from model: openvla/openvla-7b

Model Sources

Uses

Direct Use

This model is intended for research in Vision-Language-Action (VLA) models, particularly for understanding and improving out-of-distribution (OOD) generalization in robotic and agent control tasks through visual representation alignment. Researchers can use this adapter and methodology to fine-tune base VLA models and explore the impact of representation degradation.

Out-of-Scope Use

As a research artifact, this model is not intended for deployment in real-world, safety-critical applications without further rigorous testing, validation, and adaptation. It is focused on studying and mitigating specific representation issues in VLAs, rather than serving as a production-ready agent.

How to Get Started with the Model

Installation

Use the environment setup commands below to get started:

# Create and activate conda environment
conda create -n blindvla python=3.10 -y
conda activate blindvla

# Install PyTorch. Below is a sample command to do this, but you should check the following link
# to find installation instructions that are specific to your compute platform:
# https://pytorch.org/get-started/locally/
pip install torch torchvision torchaudio

# Clone and install the BlindVLA repo
git clone https://github.com/CognitiveAISystems/BlindVLA.git
cd BlindVLA
pip install -e ./openvla

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
#   =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip3 install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation
pip install diffusers==0.33.0

pip install -e ./ManiSkill
pip install -e ./SimplerEnv
pip install -U "typeguard>=3"

The pretrained OpenVLA model is warmed up using 140 episodes collected with Octo-Small and a motion planner for 2k steps. You can download the training dataset (1.4k episodes) here and the warm-up checkpoint here.

Sample Usage (LoRA Fine-tuning with Visual Representation Alignment)

Below is a minimal example from the GitHub README of how you can integrate Visual Representation Alignment into your VLA’s training pipeline. Just plug in these few lines right after your forward pass — no architecture changes are needed.

# ....
# out = vla.forward(..., output_hidden_states=True)
# pixel_values = preprocessor(image, ...)
# ....
#

n_vis = out.projector_features.shape[1]
pos, pos_end = 1,

# 1. Extract VLA's visual features from specific layer and project to visual teacher dimention
vla_features = out.hidden_states[align_layer][:, pos:pos_end]
vla_features = alignment_projector(vla_feats)

# 2. Get teacher patch features
with torch.no_grad():
    teacher_features = teacher_vision_backbone(pixel_values)

# 3. Compute cosine alignment loss
emb_t = F.normalize(teacher_features, dim=-1)
emb_s = F.normalize(vla_features, dim=-1)

cossim = (emb_t * emb_s).sum(dim=-1)
align_loss = (-cossim).mean()

loss += cfg.align_coeff * align_loss

You can run LoRA fine-tuning with Visual Representation Alignment using this script:

openvla_path="tttonyalpha/openvla-7b-warmup-checkpoint_merged_002000_lora_002000"

torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
  --vla_path "$openvla_path" \
  --data_root_dir "datasets" \
  --dataset_name "sft" \
  --run_root_dir "runs" \
  --lora_rank 32 \
  --batch_size 8 \
  --max_steps 60000 \
  --eval_steps 200 \
  --save_steps "0,5000,10000,20000,30000,40000,50000,60000" \
  --grad_accumulation_steps 1 \
  --learning_rate 5e-4 \
  --image_aug True

Training Details

Training Data

The model is warmed up using 140 episodes collected with Octo-Small and a motion planner for 2k steps. A larger training dataset (1.4k episodes) is available here.

Training Procedure

Training Hyperparameters

The model is fine-tuned using LoRA with the following hyperparameters, as described in the provided finetune.py script:

  • LoRA rank: 32
  • Batch size: 8
  • Max steps: 60000
  • Evaluation steps: 200
  • Save steps: 0, 5000, 10000, 20000, 30000, 40000, 50000, 60000
  • Gradient accumulation steps: 1
  • Learning rate: 5e-4
  • Image augmentation: True

Evaluation

Testing Data, Factors & Metrics

The model is evaluated using the VL-Think Task Suite, a diagnostic suite assessing the transfer of VL understanding and knowledge from VLMs to VLAs independently of low-level control. The suite includes various tasks, focusing on the ability to interpret visual symbols, compositional cues, and categorical relations. Examples of tasks include:

  • PutOnShapeInSceneMultiColor-v1 (13 shapes)
  • PutOnColorInSceneMulti-v1 (8 colors)
  • PutOnLaundryIconInSceneMulti-v1 (17 laundry icons)
  • PutOnNumberInSceneParity-v1 (8 numbers)
  • PutOnPublicInfoSignInSceneMulti-v1 (14 public info signs)
  • PutOnSignTrafficInSceneMulti-v1 (24 traffic signs)
  • PutOnWeatherIconInSceneMulti-v1 (9 weather icons)
  • PutOnArrowSignInSceneMulti-v1 (4 directions)

Evaluation is performed using batched environments for efficient parallel processing.

Results

The paper demonstrates that Visual Representation Alignment mitigates degradation of visual representations and yields improved generalization to out-of-distribution (OOD) scenarios. For detailed results, refer to the paper.

Citation

If you find our code useful, please cite our paper:

@misc{kachaev2025dontblindvlaaligning,
      title={Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization},
      author={Nikita Kachaev and Mikhail Kolosov and Daniil Zelezetsky and Alexey K. Kovalev and Aleksandr I. Panov},
      year={2025},
      eprint={2510.25616},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.25616},
}

Acknowledgement

BlindVLA is built with reference on: RL4VLA, Simpler, REPA, OpenVLA. Many thanks for their awesome work!

Downloads last month
20
Video Preview
loading

Model tree for tttonyalpha/openvla-7b-warmup-checkpoint_lora_002000

Base model

openvla/openvla-7b
Adapter
(4)
this model

Collection including tttonyalpha/openvla-7b-warmup-checkpoint_lora_002000