Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# 🎯 General Video Embedder (GVE)
|
| 2 |
|
| 3 |
> **One Embedder for All Video Retrieval Scenarios**
|
|
@@ -60,11 +83,9 @@ Built on **Qwen2.5-VL** and trained only with LoRA with **13M** collected and sy
|
|
| 60 |
1. Loading model
|
| 61 |
|
| 62 |
```python
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
model_path =
|
| 66 |
-
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto', low_cpu_mem_usage=True, torch_dtype='bfloat16')
|
| 67 |
-
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, add_eos_token=True)
|
| 68 |
processor.tokenizer.padding_side = 'left'
|
| 69 |
```
|
| 70 |
|
|
@@ -111,9 +132,13 @@ embedding = F.normalize(outputs['last_hidden_state'][:, -1, :], p=2, dim=1)
|
|
| 111 |
## 📚 Citation
|
| 112 |
|
| 113 |
```bibtex
|
| 114 |
-
@
|
| 115 |
-
title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
|
| 116 |
-
author={Zhuoning Guo
|
| 117 |
-
year={2025}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
}
|
| 119 |
-
```
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- pytorch
|
| 7 |
+
- video
|
| 8 |
+
- retrieval
|
| 9 |
+
- embedding
|
| 10 |
+
- multimodal
|
| 11 |
+
- qwen2.5-vl
|
| 12 |
+
pipeline_tag: sentence-similarity
|
| 13 |
+
datasets:
|
| 14 |
+
- Alibaba-NLP/UVRB
|
| 15 |
+
- Vividbot/vast-2m-vi
|
| 16 |
+
- TempoFunk/webvid-10M
|
| 17 |
+
- OpenGVLab/InternVid
|
| 18 |
+
metrics:
|
| 19 |
+
- recall
|
| 20 |
+
base_model:
|
| 21 |
+
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
# 🎯 General Video Embedder (GVE)
|
| 25 |
|
| 26 |
> **One Embedder for All Video Retrieval Scenarios**
|
|
|
|
| 83 |
1. Loading model
|
| 84 |
|
| 85 |
```python
|
| 86 |
+
model_path = 'Alibaba-NLP/GVE-3B'
|
| 87 |
+
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto', low_cpu_mem_usage=True, torch_dtype=torch.bfloat16)
|
| 88 |
+
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=True)
|
|
|
|
|
|
|
| 89 |
processor.tokenizer.padding_side = 'left'
|
| 90 |
```
|
| 91 |
|
|
|
|
| 132 |
## 📚 Citation
|
| 133 |
|
| 134 |
```bibtex
|
| 135 |
+
@misc{guo2025gve,
|
| 136 |
+
title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
|
| 137 |
+
author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu},
|
| 138 |
+
year={2025},
|
| 139 |
+
eprint={2510.27571},
|
| 140 |
+
archivePrefix={arXiv},
|
| 141 |
+
primaryClass={cs.CV},
|
| 142 |
+
url={https://arxiv.org/abs/2510.27571},
|
| 143 |
}
|
| 144 |
+
```
|