|
|
--- |
|
|
license: cc-by-sa-4.0 |
|
|
library_name: llava |
|
|
datasets: |
|
|
- lmms-lab/LLaVA-Video-178K |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- lmms-lab/llava-onevision-qwen2-0.5b-ov |
|
|
pipeline_tag: video-text-to-text |
|
|
tags: |
|
|
- token-reduction |
|
|
- video-understanding |
|
|
- video-llm |
|
|
- video-qa |
|
|
- extreme-token-reduction |
|
|
- llava |
|
|
- onevision |
|
|
- vqtoken |
|
|
- discrete-tokens |
|
|
- compression |
|
|
- lmms-eval |
|
|
- efficiency |
|
|
--- |
|
|
|
|
|
|
|
|
# VQ-Token Β· LLaVA-OneVision 0.5B (Extreme Token Reduction) |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/pdf/2503.16980"> |
|
|
<img src="https://img.shields.io/badge/ArXiv-2503.16980-red?style=for-the-badge&logo=arxiv" alt="ArXiv"/> |
|
|
</a> |
|
|
<a href="https://www.zhanghaichao.xyz/VQToken/"> |
|
|
<img src="https://img.shields.io/badge/Project-Website-blue?style=for-the-badge&logo=google-chrome" alt="Website"/> |
|
|
</a> |
|
|
<a href="https://github.com/Hai-chao-Zhang/VQToken"> |
|
|
<img src="https://img.shields.io/badge/Code-GitHub-black?style=for-the-badge&logo=github" alt="GitHub"/> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
<!-- === Teaser (upload a local file named teaser.png to the model repo, or edit the path below) === --> |
|
|
<p align="center"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/66393f5a1231260674ae798e/uxTxaVBWdFrGRHIyeUJbp.jpeg" alt="VQToken Teaser" width="100%"> |
|
|
</p> |
|
|
|
|
|
**VQToken** is a neural **discrete token** representation for video that enables **extreme token reduction** (~**0.07%** of dense tokens) while retaining strong downstream performance. |
|
|
This repository hosts the **0.5B** VQToken-enabled **LLaVA-OneVision** checkpoint. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Model Summary |
|
|
|
|
|
- **Base backbone:** LLaVA-OneVision (0.5B) |
|
|
- **VQToken module:** learns discrete video tokens; supports **fixed** / **adaptive** token budgets |
|
|
- **Goal:** reduce video token count dramatically while preserving vLLM accuracy |
|
|
- **Interface:** works with **[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)** (preferred), and the modified LLaVA-OneVision loader in the project repo |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ How this checkpoint was trained |
|
|
|
|
|
- **Finetune script:** [`finetune_ov_all.sh`](https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh) |
|
|
- **Dataset:** [`lmms-lab/LLaVA-Video-178K`](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) |
|
|
|
|
|
The VQToken adapter is integrated with OneVision-0.5B and finetuned on the above dataset. See the training script for full hyperparameters and pipeline details. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Test (CLI via lmms-eval) |
|
|
|
|
|
We recommend testing with **lmms-eval**. The repo provides a ready-made script: |
|
|
|
|
|
- **Script:** [`test_vqtoken_0.5b.sh`](https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh) |
|
|
|
|
|
Or run the equivalent command directly: |
|
|
|
|
|
```bash |
|
|
# env (adjust as needed) |
|
|
export HF_HOME="/path/to/your/hf/cache" |
|
|
export HF_TOKEN="your_hf_token_here" |
|
|
export HF_HUB_ENABLE_HF_TRANSFER=1 |
|
|
# if any eval calls OpenAI endpoints |
|
|
# export OPENAI_API_KEY="your_openai_key_here" |
|
|
|
|
|
# Helpful on some single-GPU setups |
|
|
export NCCL_P2P_DISABLE="1" |
|
|
export NCCL_IB_DISABLE="1" |
|
|
|
|
|
PRETRAIN=haichaozhang/VQ-Token-llava-ov-0.5b |
|
|
|
|
|
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port 29509 \ |
|
|
-m lmms_eval \ |
|
|
--model llava_onevision_vqtoken \ |
|
|
--model_args pretrained=$PRETRAIN,conv_template=qwen_1_5,model_name=llava_qwen \ |
|
|
--tasks activitynetqa --batch_size 1 \ |
|
|
--log_samples \ |
|
|
--log_samples_suffix llava_onevision \ |
|
|
--output_path ./logs_vqtoken/ |
|
|
``` |
|
|
|
|
|
> You can swap `--tasks` for other video QA benchmarks supported by **lmms-eval**. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ Minimal Python Inference |
|
|
|
|
|
```python |
|
|
import copy, numpy as np, torch |
|
|
from decord import VideoReader, cpu |
|
|
from llava.model.builder import load_pretrained_model |
|
|
from llava.mm_utils import tokenizer_image_token |
|
|
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN |
|
|
from llava.conversation import conv_templates |
|
|
|
|
|
pretrained = "haichaozhang/VQ-Token-llava-ov-0.5b" |
|
|
tok, model, imgproc, _ = load_pretrained_model( |
|
|
pretrained, None, "llava_qwen", |
|
|
device_map="auto", attn_implementation="sdpa", multimodal=True |
|
|
) |
|
|
model.eval() |
|
|
|
|
|
def frames(path, n=16): |
|
|
vr = VideoReader(path, ctx=cpu(0)) |
|
|
idx = np.linspace(0, len(vr)-1, n, dtype=int).tolist() |
|
|
return vr.get_batch(idx).asnumpy() # (T,H,W,C) |
|
|
|
|
|
video = "sample/demo.mp4" |
|
|
vid = frames(video, 16) |
|
|
pix = imgproc.preprocess(vid, return_tensors="pt")["pixel_values"].half().cuda() |
|
|
images = [pix] |
|
|
|
|
|
conv = copy.deepcopy(conv_templates["qwen_1_5"]) |
|
|
q = f"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video." |
|
|
conv.append_message(conv.roles[0], q); conv.append_message(conv.roles[1], None) |
|
|
prompt = conv.get_prompt() |
|
|
|
|
|
ids = tokenizer_image_token(prompt, tok, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda() |
|
|
sizes = [f.shape[:2] for f in vid] |
|
|
|
|
|
with torch.no_grad(): |
|
|
out = model.generate( |
|
|
ids, images=images, image_sizes=sizes, |
|
|
do_sample=False, temperature=0, max_new_tokens=512, |
|
|
modalities=["video"], vis=True |
|
|
) |
|
|
|
|
|
print(tok.batch_decode(out, skip_special_tokens=True)[0]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π¦ Intended Use & Notes |
|
|
|
|
|
- **Use cases:** video question answering, video captioning/understanding scenarios where token budget is tight. |
|
|
- **Strengths:** **extreme token reduction** (~0.07%) with competitive performance; fixed/adaptive regimes. |
|
|
- **Out-of-scope / caveats:** model may hallucinate or be brittle on out-of-distribution content; always validate on your task. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Evaluation |
|
|
|
|
|
We evaluate through **lmms-eval** for consistent, reproducible benchmarking. See repo logs and the paper for details on datasets, metrics, and token budgets (fixed vs. adaptive). |
|
|
|
|
|
--- |
|
|
|
|
|
## π Resources |
|
|
|
|
|
- **Paper (arXiv):** https://arxiv.org/pdf/2503.16980 |
|
|
- **Project Page:** https://www.zhanghaichao.xyz/VQToken/ |
|
|
- **Code:** https://github.com/Hai-chao-Zhang/VQToken |
|
|
- **Dataset:** https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K |
|
|
- **Test Script:** https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh |
|
|
- **Train Script:** https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{zhang2025vqtoken, |
|
|
title = {VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models}, |
|
|
author = {Haichao Zhang and Yun Fu}, |
|
|
booktitle = {NeurIPS}, |
|
|
year = {2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Acknowledgements |
|
|
|
|
|
Thanks to **LLaVA-OneVision / LLaVA-NeXT** and **lmms-eval** communities for open tooling and baselines. |
|
|
|