---
license: cc-by-sa-4.0
library_name: llava
datasets:
- lmms-lab/LLaVA-Video-178K
language:
- en
metrics:
- accuracy
base_model:
- lmms-lab/llava-onevision-qwen2-0.5b-ov
pipeline_tag: video-text-to-text
tags:
- token-reduction
- video-understanding
- video-llm
- video-qa
- extreme-token-reduction
- llava
- onevision
- vqtoken
- discrete-tokens
- compression
- lmms-eval
- efficiency
---
# VQ-Token ยท LLaVA-OneVision 0.5B (Extreme Token Reduction)
**VQToken** is a neural **discrete token** representation for video that enables **extreme token reduction** (~**0.07%** of dense tokens) while retaining strong downstream performance.
This repository hosts the **0.5B** VQToken-enabled **LLaVA-OneVision** checkpoint.
---
## ๐ง Model Summary
- **Base backbone:** LLaVA-OneVision (0.5B)
- **VQToken module:** learns discrete video tokens; supports **fixed** / **adaptive** token budgets
- **Goal:** reduce video token count dramatically while preserving vLLM accuracy
- **Interface:** works with **[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)** (preferred), and the modified LLaVA-OneVision loader in the project repo
---
## ๐๏ธ How this checkpoint was trained
- **Finetune script:** [`finetune_ov_all.sh`](https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh)
- **Dataset:** [`lmms-lab/LLaVA-Video-178K`](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K)
The VQToken adapter is integrated with OneVision-0.5B and finetuned on the above dataset. See the training script for full hyperparameters and pipeline details.
---
## ๐ Quick Test (CLI via lmms-eval)
We recommend testing with **lmms-eval**. The repo provides a ready-made script:
- **Script:** [`test_vqtoken_0.5b.sh`](https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh)
Or run the equivalent command directly:
```bash
# env (adjust as needed)
export HF_HOME="/path/to/your/hf/cache"
export HF_TOKEN="your_hf_token_here"
export HF_HUB_ENABLE_HF_TRANSFER=1
# if any eval calls OpenAI endpoints
# export OPENAI_API_KEY="your_openai_key_here"
# Helpful on some single-GPU setups
export NCCL_P2P_DISABLE="1"
export NCCL_IB_DISABLE="1"
PRETRAIN=haichaozhang/VQ-Token-llava-ov-0.5b
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port 29509 \
-m lmms_eval \
--model llava_onevision_vqtoken \
--model_args pretrained=$PRETRAIN,conv_template=qwen_1_5,model_name=llava_qwen \
--tasks activitynetqa --batch_size 1 \
--log_samples \
--log_samples_suffix llava_onevision \
--output_path ./logs_vqtoken/
```
> You can swap `--tasks` for other video QA benchmarks supported by **lmms-eval**.
---
## ๐งช Minimal Python Inference
```python
import copy, numpy as np, torch
from decord import VideoReader, cpu
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
pretrained = "haichaozhang/VQ-Token-llava-ov-0.5b"
tok, model, imgproc, _ = load_pretrained_model(
pretrained, None, "llava_qwen",
device_map="auto", attn_implementation="sdpa", multimodal=True
)
model.eval()
def frames(path, n=16):
vr = VideoReader(path, ctx=cpu(0))
idx = np.linspace(0, len(vr)-1, n, dtype=int).tolist()
return vr.get_batch(idx).asnumpy() # (T,H,W,C)
video = "sample/demo.mp4"
vid = frames(video, 16)
pix = imgproc.preprocess(vid, return_tensors="pt")["pixel_values"].half().cuda()
images = [pix]
conv = copy.deepcopy(conv_templates["qwen_1_5"])
q = f"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video."
conv.append_message(conv.roles[0], q); conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
ids = tokenizer_image_token(prompt, tok, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
sizes = [f.shape[:2] for f in vid]
with torch.no_grad():
out = model.generate(
ids, images=images, image_sizes=sizes,
do_sample=False, temperature=0, max_new_tokens=512,
modalities=["video"], vis=True
)
print(tok.batch_decode(out, skip_special_tokens=True)[0])
```
---
## ๐ฆ Intended Use & Notes
- **Use cases:** video question answering, video captioning/understanding scenarios where token budget is tight.
- **Strengths:** **extreme token reduction** (~0.07%) with competitive performance; fixed/adaptive regimes.
- **Out-of-scope / caveats:** model may hallucinate or be brittle on out-of-distribution content; always validate on your task.
---
## ๐ Evaluation
We evaluate through **lmms-eval** for consistent, reproducible benchmarking. See repo logs and the paper for details on datasets, metrics, and token budgets (fixed vs. adaptive).
---
## ๐ Resources
- **Paper (arXiv):** https://arxiv.org/pdf/2503.16980
- **Project Page:** https://www.zhanghaichao.xyz/VQToken/
- **Code:** https://github.com/Hai-chao-Zhang/VQToken
- **Dataset:** https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- **Test Script:** https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh
- **Train Script:** https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh
---
## ๐ Citation
```bibtex
@inproceedings{zhang2025vqtoken,
title = {VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models},
author = {Haichao Zhang and Yun Fu},
booktitle = {NeurIPS},
year = {2025}
}
```
---
## ๐ Acknowledgements
Thanks to **LLaVA-OneVision / LLaVA-NeXT** and **lmms-eval** communities for open tooling and baselines.