nielsr HF Staff commited on
Commit
b73f590
·
verified ·
1 Parent(s): 0c82cb6

Improve model card: Add pipeline tag, library name, paper link, authors, and sample usage

Browse files

This PR significantly enhances the model card for `Video-R1/Qwen2.5-VL-7B-COT-SFT` by:

- Adding the `pipeline_tag: video-text-to-text` to ensure proper categorization and discoverability on the Hugging Face Hub.
- Specifying `library_name: transformers`, enabling direct integration and a "how to use" widget for the model with the 🤗 Transformers library, supported by `config.json` evidence.
- Adding an explicit link to the official paper on Hugging Face Papers.
- Including the list of authors for proper attribution.
- Expanding the model description with an "About" section based on the paper's abstract.
- Providing a clear `transformers`-based code snippet for sample inference.

These improvements will make the model more accessible, informative, and user-friendly for the community.

Files changed (1) hide show
  1. README.md +87 -6
README.md CHANGED
@@ -1,15 +1,96 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - Video-R1/Video-R1-data
5
  language:
6
  - en
7
- base_model:
8
- - Qwen/Qwen2.5-7B-Instruct
 
9
  ---
10
 
11
- The SFT cold start model trained by the Video-R1-COT-165k dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- This intermediate checkpoint can be used as the base model for RL training on the Video-R1-260k dataset.
14
 
15
- Please refer to: https://github.com/tulerfeng/Video-R1
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-7B-Instruct
4
  datasets:
5
  - Video-R1/Video-R1-data
6
  language:
7
  - en
8
+ license: apache-2.0
9
+ pipeline_tag: video-text-to-text
10
+ library_name: transformers
11
  ---
12
 
13
+ # Video-R1: Reinforcing Video Reasoning in MLLMs
14
+
15
+ This repository contains `Video-R1/Qwen2.5-VL-7B-COT-SFT`, the SFT (Supervised Fine-Tuning) cold start model trained using the Video-R1-COT-165k dataset. This intermediate checkpoint serves as the base model for further RL (Reinforcement Learning) training on the Video-R1-260k dataset to produce the final Video-R1 models.
16
+
17
+ For more details, please refer to the paper: [Video-R1: Reinforcing Video Reasoning in MLLMs](https://huggingface.co/papers/2503.21776).
18
+ The full code and additional resources are available on the [GitHub repository](https://github.com/tulerfeng/Video-R1).
19
+
20
+ ## About Video-R1
21
+
22
+ Video-R1 represents the first systematic exploration of the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs), inspired by the success of DeepSeek-R1. The project addresses key challenges in video reasoning, particularly the lack of temporal modeling and the scarcity of high-quality video-reasoning data.
23
+
24
+ To tackle these issues, Video-R1 proposes the T-GRPO algorithm, an extension of GRPO that explicitly encourages models to leverage temporal information in videos for reasoning. It also strategically incorporates high-quality image-reasoning data into the training process. The model was trained on two newly constructed datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data.
25
+
26
+ Experimental results demonstrate that Video-R1 achieves significant improvements on various video reasoning benchmarks, including VideoMMMU, VSI-Bench, MVBench, and TempCompass. Notably, Video-R1-7B has shown competitive performance, even surpassing proprietary models like GPT-4o on certain video spatial reasoning tasks.
27
+
28
+ ## Authors
29
+
30
+ - Kaituo Feng
31
+ - Kaixiong Gong
32
+ - Bohao Li
33
+ - Zonghao Guo
34
+ - Yibing Wang
35
+ - Tianshuo Peng
36
+ - Benyou Wang
37
+ - Xiangyu Yue
38
+
39
+ ## Sample Usage
40
+
41
+ We provide a simple generation process for using this SFT cold start model with the `transformers` library.
42
+
43
+ ```python
44
+ import torch
45
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
46
+ from PIL import Image
47
+ import cv2
48
+ from decord import VideoReader, cpu
49
+
50
+ # Load model, tokenizer, and processor
51
+ model_id = "Video-R1/Qwen2.5-VL-7B-COT-SFT" # This specific SFT checkpoint
52
+ model = AutoModelForCausalLM.from_pretrained(
53
+ model_id,
54
+ torch_dtype=torch.bfloat16,
55
+ device_map="cuda",
56
+ trust_remote_code=True,
57
+ )
58
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
59
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
60
+
61
+ # Function to load video frames
62
+ def load_video_frames(video_path, num_frames=16):
63
+ vr = VideoReader(video_path, ctx=cpu(0))
64
+ total_frames = len(vr)
65
+ indices = [int(i * (total_frames / num_frames)) for i in range(num_frames)]
66
+ frames = vr.get_batch(indices).asnumpy()
67
+ frames = [Image.fromarray(frame) for frame in frames]
68
+ return frames
69
+
70
+ # Example usage
71
+ # Replace with your actual video path
72
+ # For demonstration, ensure a video file like 'examples/video1.mp4' exists or adjust path
73
+ video_path = "./examples/video1.mp4"
74
+ frames = load_video_frames(video_path)
75
+ text = "Describe this video in detail."
76
+
77
+ # Prepare inputs
78
+ inputs = processor(frames=frames, text=text, return_tensors="pt").to("cuda")
79
+
80
+ # Generate response
81
+ output = model.generate(**inputs, max_new_tokens=50)
82
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
83
+ ```
84
+
85
+ ## Citation
86
 
87
+ If you find our work helpful for your research, please consider citing our work:
88
 
89
+ ```bibtex
90
+ @article{feng2025video,
91
+ title={Video-R1: Reinforcing Video Reasoning in MLLMs},
92
+ author={Feng, Kaituo and Gong, Kaixiong and Li, Bohao and Guo, Zonghao and Wang, Yibing and Peng, Tianshuo and Wang, Benyou and Yue, Xiangyu},
93
+ journal={arXiv preprint arXiv:2503.21776},
94
+ year={2025}
95
+ }
96
+ ```