--- license: cc-by-sa-4.0 library_name: llava datasets: - lmms-lab/LLaVA-Video-178K language: - en metrics: - accuracy base_model: - lmms-lab/llava-onevision-qwen2-0.5b-ov pipeline_tag: video-text-to-text tags: - token-reduction - video-understanding - video-llm - video-qa - extreme-token-reduction - llava - onevision - vqtoken - discrete-tokens - compression - lmms-eval - efficiency --- # VQ-Token · LLaVA-OneVision 0.5B (Extreme Token Reduction)

VQToken Teaser

**VQToken** is a neural **discrete token** representation for video that enables **extreme token reduction** (~**0.07%** of dense tokens) while retaining strong downstream performance. This repository hosts the **0.5B** VQToken-enabled **LLaVA-OneVision** checkpoint. --- ## 🧠 Model Summary - **Base backbone:** LLaVA-OneVision (0.5B) - **VQToken module:** learns discrete video tokens; supports **fixed** / **adaptive** token budgets - **Goal:** reduce video token count dramatically while preserving vLLM accuracy - **Interface:** works with **[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)** (preferred), and the modified LLaVA-OneVision loader in the project repo --- ## 🏗️ How this checkpoint was trained - **Finetune script:** [`finetune_ov_all.sh`](https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh) - **Dataset:** [`lmms-lab/LLaVA-Video-178K`](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) The VQToken adapter is integrated with OneVision-0.5B and finetuned on the above dataset. See the training script for full hyperparameters and pipeline details. --- ## 🚀 Quick Test (CLI via lmms-eval) We recommend testing with **lmms-eval**. The repo provides a ready-made script: - **Script:** [`test_vqtoken_0.5b.sh`](https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh) Or run the equivalent command directly: ```bash # env (adjust as needed) export HF_HOME="/path/to/your/hf/cache" export HF_TOKEN="your_hf_token_here" export HF_HUB_ENABLE_HF_TRANSFER=1 # if any eval calls OpenAI endpoints # export OPENAI_API_KEY="your_openai_key_here" # Helpful on some single-GPU setups export NCCL_P2P_DISABLE="1" export NCCL_IB_DISABLE="1" PRETRAIN=haichaozhang/VQ-Token-llava-ov-0.5b CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port 29509 \ -m lmms_eval \ --model llava_onevision_vqtoken \ --model_args pretrained=$PRETRAIN,conv_template=qwen_1_5,model_name=llava_qwen \ --tasks activitynetqa --batch_size 1 \ --log_samples \ --log_samples_suffix llava_onevision \ --output_path ./logs_vqtoken/ ``` > You can swap `--tasks` for other video QA benchmarks supported by **lmms-eval**. --- ## 🧪 Minimal Python Inference ```python import copy, numpy as np, torch from decord import VideoReader, cpu from llava.model.builder import load_pretrained_model from llava.mm_utils import tokenizer_image_token from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN from llava.conversation import conv_templates pretrained = "haichaozhang/VQ-Token-llava-ov-0.5b" tok, model, imgproc, _ = load_pretrained_model( pretrained, None, "llava_qwen", device_map="auto", attn_implementation="sdpa", multimodal=True ) model.eval() def frames(path, n=16): vr = VideoReader(path, ctx=cpu(0)) idx = np.linspace(0, len(vr)-1, n, dtype=int).tolist() return vr.get_batch(idx).asnumpy() # (T,H,W,C) video = "sample/demo.mp4" vid = frames(video, 16) pix = imgproc.preprocess(vid, return_tensors="pt")["pixel_values"].half().cuda() images = [pix] conv = copy.deepcopy(conv_templates["qwen_1_5"]) q = f"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video." conv.append_message(conv.roles[0], q); conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() ids = tokenizer_image_token(prompt, tok, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda() sizes = [f.shape[:2] for f in vid] with torch.no_grad(): out = model.generate( ids, images=images, image_sizes=sizes, do_sample=False, temperature=0, max_new_tokens=512, modalities=["video"], vis=True ) print(tok.batch_decode(out, skip_special_tokens=True)[0]) ``` --- ## 📦 Intended Use & Notes - **Use cases:** video question answering, video captioning/understanding scenarios where token budget is tight. - **Strengths:** **extreme token reduction** (~0.07%) with competitive performance; fixed/adaptive regimes. - **Out-of-scope / caveats:** model may hallucinate or be brittle on out-of-distribution content; always validate on your task. --- ## 📊 Evaluation We evaluate through **lmms-eval** for consistent, reproducible benchmarking. See repo logs and the paper for details on datasets, metrics, and token budgets (fixed vs. adaptive). --- ## 🔗 Resources - **Paper (arXiv):** https://arxiv.org/pdf/2503.16980 - **Project Page:** https://www.zhanghaichao.xyz/VQToken/ - **Code:** https://github.com/Hai-chao-Zhang/VQToken - **Dataset:** https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K - **Test Script:** https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh - **Train Script:** https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh --- ## 📚 Citation ```bibtex @inproceedings{zhang2025vqtoken, title = {VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models}, author = {Haichao Zhang and Yun Fu}, booktitle = {NeurIPS}, year = {2025} } ``` --- ## 🙏 Acknowledgements Thanks to **LLaVA-OneVision / LLaVA-NeXT** and **lmms-eval** communities for open tooling and baselines.