README.md · haichaozhang/VQ-Token-llava-ov-0.5b at main

VQ-Token-llava-ov-0.5b / README.md

haichaozhang

Update README.md

ab20666 verified 2 months ago

preview code

raw

history blame contribute delete

6.47 kB

	---
	license: cc-by-sa-4.0
	library_name: llava
	datasets:
	- lmms-lab/LLaVA-Video-178K
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- lmms-lab/llava-onevision-qwen2-0.5b-ov
	pipeline_tag: video-text-to-text
	tags:
	- token-reduction
	- video-understanding
	- video-llm
	- video-qa
	- extreme-token-reduction
	- llava
	- onevision
	- vqtoken
	- discrete-tokens
	- compression
	- lmms-eval
	- efficiency
	---


	# VQ-Token · LLaVA-OneVision 0.5B (Extreme Token Reduction)

	<p align="center">
	<a href="https://arxiv.org/pdf/2503.16980">
	<img src="https://img.shields.io/badge/ArXiv-2503.16980-red?style=for-the-badge&logo=arxiv" alt="ArXiv"/>
	</a>
	<a href="https://www.zhanghaichao.xyz/VQToken/">
	<img src="https://img.shields.io/badge/Project-Website-blue?style=for-the-badge&logo=google-chrome" alt="Website"/>
	</a>
	<a href="https://github.com/Hai-chao-Zhang/VQToken">
	<img src="https://img.shields.io/badge/Code-GitHub-black?style=for-the-badge&logo=github" alt="GitHub"/>
	</a>
	</p>

	<!-- === Teaser (upload a local file named teaser.png to the model repo, or edit the path below) === -->
	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/66393f5a1231260674ae798e/uxTxaVBWdFrGRHIyeUJbp.jpeg" alt="VQToken Teaser" width="100%">
	</p>

	VQToken is a neural discrete token representation for video that enables extreme token reduction (~0.07% of dense tokens) while retaining strong downstream performance.
	This repository hosts the 0.5B VQToken-enabled LLaVA-OneVision checkpoint.

	---

	## 🧠 Model Summary

	- Base backbone: LLaVA-OneVision (0.5B)
	- VQToken module: learns discrete video tokens; supports fixed / adaptive token budgets
	- Goal: reduce video token count dramatically while preserving vLLM accuracy
	- Interface: works with [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) (preferred), and the modified LLaVA-OneVision loader in the project repo

	---

	## 🏗️ How this checkpoint was trained

	- Finetune script: [`finetune_ov_all.sh`](https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh)
	- Dataset: [`lmms-lab/LLaVA-Video-178K`](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K)

	The VQToken adapter is integrated with OneVision-0.5B and finetuned on the above dataset. See the training script for full hyperparameters and pipeline details.

	---

	## 🚀 Quick Test (CLI via lmms-eval)

	We recommend testing with lmms-eval. The repo provides a ready-made script:

	- Script: [`test_vqtoken_0.5b.sh`](https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh)

	Or run the equivalent command directly:

	```bash
	# env (adjust as needed)
	export HF_HOME="/path/to/your/hf/cache"
	export HF_TOKEN="your_hf_token_here"
	export HF_HUB_ENABLE_HF_TRANSFER=1
	# if any eval calls OpenAI endpoints
	# export OPENAI_API_KEY="your_openai_key_here"

	# Helpful on some single-GPU setups
	export NCCL_P2P_DISABLE="1"
	export NCCL_IB_DISABLE="1"

	PRETRAIN=haichaozhang/VQ-Token-llava-ov-0.5b

	CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port 29509 \
	-m lmms_eval \
	--model llava_onevision_vqtoken \
	--model_args pretrained=$PRETRAIN,conv_template=qwen_1_5,model_name=llava_qwen \
	--tasks activitynetqa --batch_size 1 \
	--log_samples \
	--log_samples_suffix llava_onevision \
	--output_path ./logs_vqtoken/
	```

	> You can swap `--tasks` for other video QA benchmarks supported by lmms-eval.

	---

	## 🧪 Minimal Python Inference

	```python
	import copy, numpy as np, torch
	from decord import VideoReader, cpu
	from llava.model.builder import load_pretrained_model
	from llava.mm_utils import tokenizer_image_token
	from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
	from llava.conversation import conv_templates

	pretrained = "haichaozhang/VQ-Token-llava-ov-0.5b"
	tok, model, imgproc, _ = load_pretrained_model(
	pretrained, None, "llava_qwen",
	device_map="auto", attn_implementation="sdpa", multimodal=True
	)
	model.eval()

	def frames(path, n=16):
	vr = VideoReader(path, ctx=cpu(0))
	idx = np.linspace(0, len(vr)-1, n, dtype=int).tolist()
	return vr.get_batch(idx).asnumpy() # (T,H,W,C)

	video = "sample/demo.mp4"
	vid = frames(video, 16)
	pix = imgproc.preprocess(vid, return_tensors="pt")["pixel_values"].half().cuda()
	images = [pix]

	conv = copy.deepcopy(conv_templates["qwen_1_5"])
	q = f"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video."
	conv.append_message(conv.roles[0], q); conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	ids = tokenizer_image_token(prompt, tok, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
	sizes = [f.shape[:2] for f in vid]

	with torch.no_grad():
	out = model.generate(
	ids, images=images, image_sizes=sizes,
	do_sample=False, temperature=0, max_new_tokens=512,
	modalities=["video"], vis=True
	)

	print(tok.batch_decode(out, skip_special_tokens=True)[0])
	```

	---

	## 📦 Intended Use & Notes

	- Use cases: video question answering, video captioning/understanding scenarios where token budget is tight.
	- Strengths: extreme token reduction (~0.07%) with competitive performance; fixed/adaptive regimes.
	- Out-of-scope / caveats: model may hallucinate or be brittle on out-of-distribution content; always validate on your task.

	---

	## 📊 Evaluation

	We evaluate through lmms-eval for consistent, reproducible benchmarking. See repo logs and the paper for details on datasets, metrics, and token budgets (fixed vs. adaptive).

	---

	## 🔗 Resources

	- Paper (arXiv): https://arxiv.org/pdf/2503.16980
	- Project Page: https://www.zhanghaichao.xyz/VQToken/
	- Code: https://github.com/Hai-chao-Zhang/VQToken
	- Dataset: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
	- Test Script: https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh
	- Train Script: https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh

	---

	## 📚 Citation

	```bibtex
	@inproceedings{zhang2025vqtoken,
	title = {VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models},
	author = {Haichao Zhang and Yun Fu},
	booktitle = {NeurIPS},
	year = {2025}
	}
	```

	---

	## 🙏 Acknowledgements

	Thanks to LLaVA-OneVision / LLaVA-NeXT and lmms-eval communities for open tooling and baselines.