VQ-Token-llava-ov-0.5b / README.md

haichaozhang

Update README.md

ab20666 verified 2 months ago

preview code

raw

history blame contribute delete

6.47 kB

metadata

license: cc-by-sa-4.0
library_name: llava
datasets:
  - lmms-lab/LLaVA-Video-178K
language:
  - en
metrics:
  - accuracy
base_model:
  - lmms-lab/llava-onevision-qwen2-0.5b-ov
pipeline_tag: video-text-to-text
tags:
  - token-reduction
  - video-understanding
  - video-llm
  - video-qa
  - extreme-token-reduction
  - llava
  - onevision
  - vqtoken
  - discrete-tokens
  - compression
  - lmms-eval
  - efficiency

VQ-Token · LLaVA-OneVision 0.5B (Extreme Token Reduction)

VQToken Teaser

VQToken is a neural discrete token representation for video that enables extreme token reduction (~0.07% of dense tokens) while retaining strong downstream performance.
This repository hosts the 0.5B VQToken-enabled LLaVA-OneVision checkpoint.

🧠 Model Summary

Base backbone: LLaVA-OneVision (0.5B)
VQToken module: learns discrete video tokens; supports fixed / adaptive token budgets
Goal: reduce video token count dramatically while preserving vLLM accuracy
Interface: works with lmms-eval (preferred), and the modified LLaVA-OneVision loader in the project repo

🏗️ How this checkpoint was trained

Finetune script: finetune_ov_all.sh
Dataset: lmms-lab/LLaVA-Video-178K

The VQToken adapter is integrated with OneVision-0.5B and finetuned on the above dataset. See the training script for full hyperparameters and pipeline details.

🚀 Quick Test (CLI via lmms-eval)

We recommend testing with lmms-eval. The repo provides a ready-made script:

Script: test_vqtoken_0.5b.sh

Or run the equivalent command directly:

# env (adjust as needed)
export HF_HOME="/path/to/your/hf/cache"
export HF_TOKEN="your_hf_token_here"
export HF_HUB_ENABLE_HF_TRANSFER=1
# if any eval calls OpenAI endpoints
# export OPENAI_API_KEY="your_openai_key_here"

# Helpful on some single-GPU setups
export NCCL_P2P_DISABLE="1"
export NCCL_IB_DISABLE="1"

PRETRAIN=haichaozhang/VQ-Token-llava-ov-0.5b

CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port 29509 \
  -m lmms_eval \
  --model llava_onevision_vqtoken \
  --model_args pretrained=$PRETRAIN,conv_template=qwen_1_5,model_name=llava_qwen \
  --tasks activitynetqa --batch_size 1 \
  --log_samples \
  --log_samples_suffix llava_onevision \
  --output_path ./logs_vqtoken/

You can swap --tasks for other video QA benchmarks supported by lmms-eval.

🧪 Minimal Python Inference

import copy, numpy as np, torch
from decord import VideoReader, cpu
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates

pretrained = "haichaozhang/VQ-Token-llava-ov-0.5b"
tok, model, imgproc, _ = load_pretrained_model(
    pretrained, None, "llava_qwen",
    device_map="auto", attn_implementation="sdpa", multimodal=True
)
model.eval()

def frames(path, n=16):
    vr = VideoReader(path, ctx=cpu(0))
    idx = np.linspace(0, len(vr)-1, n, dtype=int).tolist()
    return vr.get_batch(idx).asnumpy()  # (T,H,W,C)

video = "sample/demo.mp4"
vid = frames(video, 16)
pix = imgproc.preprocess(vid, return_tensors="pt")["pixel_values"].half().cuda()
images = [pix]

conv = copy.deepcopy(conv_templates["qwen_1_5"])
q = f"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video."
conv.append_message(conv.roles[0], q); conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

ids = tokenizer_image_token(prompt, tok, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
sizes = [f.shape[:2] for f in vid]

with torch.no_grad():
    out = model.generate(
        ids, images=images, image_sizes=sizes,
        do_sample=False, temperature=0, max_new_tokens=512,
        modalities=["video"], vis=True
    )

print(tok.batch_decode(out, skip_special_tokens=True)[0])

📦 Intended Use & Notes

Use cases: video question answering, video captioning/understanding scenarios where token budget is tight.
Strengths: extreme token reduction (~0.07%) with competitive performance; fixed/adaptive regimes.
Out-of-scope / caveats: model may hallucinate or be brittle on out-of-distribution content; always validate on your task.

📊 Evaluation

We evaluate through lmms-eval for consistent, reproducible benchmarking. See repo logs and the paper for details on datasets, metrics, and token budgets (fixed vs. adaptive).

🔗 Resources

Paper (arXiv): https://arxiv.org/pdf/2503.16980
Project Page: https://www.zhanghaichao.xyz/VQToken/
Code: https://github.com/Hai-chao-Zhang/VQToken
Dataset: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
Test Script: https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh
Train Script: https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh

📚 Citation

@inproceedings{zhang2025vqtoken,
  title     = {VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models},
  author    = {Haichao Zhang and Yun Fu},
  booktitle = {NeurIPS},
  year      = {2025}
}

🙏 Acknowledgements

Thanks to LLaVA-OneVision / LLaVA-NeXT and lmms-eval communities for open tooling and baselines.