Update README.md
Browse files
README.md
CHANGED
|
@@ -100,25 +100,25 @@ from qwen_vl_utils import process_vision_info
|
|
| 100 |
|
| 101 |
# default: Load the model on the available device(s)
|
| 102 |
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 103 |
-
"Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
|
| 104 |
)
|
| 105 |
|
| 106 |
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
| 107 |
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 108 |
-
# "Qwen/Qwen2.5-VL-72B-Instruct",
|
| 109 |
# torch_dtype=torch.bfloat16,
|
| 110 |
# attn_implementation="flash_attention_2",
|
| 111 |
# device_map="auto",
|
| 112 |
# )
|
| 113 |
|
| 114 |
# default processer
|
| 115 |
-
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")
|
| 116 |
|
| 117 |
# The default range for the number of visual tokens per image in the model is 4-16384.
|
| 118 |
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
|
| 119 |
# min_pixels = 256*28*28
|
| 120 |
# max_pixels = 1280*28*28
|
| 121 |
-
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
|
| 122 |
|
| 123 |
messages = [
|
| 124 |
{
|
|
@@ -210,7 +210,7 @@ The model supports a wide range of resolution inputs. By default, it uses the na
|
|
| 210 |
min_pixels = 256 * 28 * 28
|
| 211 |
max_pixels = 1280 * 28 * 28
|
| 212 |
processor = AutoProcessor.from_pretrained(
|
| 213 |
-
"Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
|
| 214 |
)
|
| 215 |
```
|
| 216 |
|
|
@@ -279,6 +279,28 @@ However, it should be noted that this method has a significant impact on the per
|
|
| 279 |
At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
|
| 280 |
|
| 281 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 282 |
|
| 283 |
## Citation
|
| 284 |
|
|
|
|
| 100 |
|
| 101 |
# default: Load the model on the available device(s)
|
| 102 |
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 103 |
+
"Qwen/Qwen2.5-VL-72B-Instruct-AWQ", torch_dtype="auto", device_map="auto"
|
| 104 |
)
|
| 105 |
|
| 106 |
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
| 107 |
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 108 |
+
# "Qwen/Qwen2.5-VL-72B-Instruct-AWQ",
|
| 109 |
# torch_dtype=torch.bfloat16,
|
| 110 |
# attn_implementation="flash_attention_2",
|
| 111 |
# device_map="auto",
|
| 112 |
# )
|
| 113 |
|
| 114 |
# default processer
|
| 115 |
+
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct-AWQ")
|
| 116 |
|
| 117 |
# The default range for the number of visual tokens per image in the model is 4-16384.
|
| 118 |
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
|
| 119 |
# min_pixels = 256*28*28
|
| 120 |
# max_pixels = 1280*28*28
|
| 121 |
+
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels)
|
| 122 |
|
| 123 |
messages = [
|
| 124 |
{
|
|
|
|
| 210 |
min_pixels = 256 * 28 * 28
|
| 211 |
max_pixels = 1280 * 28 * 28
|
| 212 |
processor = AutoProcessor.from_pretrained(
|
| 213 |
+
"Qwen/Qwen2.5-VL-72B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
|
| 214 |
)
|
| 215 |
```
|
| 216 |
|
|
|
|
| 279 |
At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
|
| 280 |
|
| 281 |
|
| 282 |
+
### Benchmark
|
| 283 |
+
#### Performance of Quantized Models
|
| 284 |
+
This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2.5-VL series. Specifically, we report:
|
| 285 |
+
|
| 286 |
+
- MMMU_VAL (Accuracy)
|
| 287 |
+
- DocVQA_VAL (Accuracy)
|
| 288 |
+
- MMBench_DEV_EN (Accuracy)
|
| 289 |
+
- MathVista_MINI (Accuracy)
|
| 290 |
+
|
| 291 |
+
We use [VLMEvalkit](https://github.com/open-compass/VLMEvalKit) to evaluate all models.
|
| 292 |
+
|
| 293 |
+
| Model Size | Quantization | MMMU_VAL | DocVQA_VAL | MMBench_EDV_EN | MathVista_MINI |
|
| 294 |
+
| --- | --- | --- | --- | --- | --- |
|
| 295 |
+
| Qwen2.5-VL-72B-Instruct | BF16<br><sup>([π€](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)[π€](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct)) | 70.0 | 96.1 | 88.2 | 75.3 |
|
| 296 |
+
| | AWQ<br><sup>([π€](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ)[π€](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct-AWQ)) | 69.1 | 96.0 | 87.9 | 73.8 |
|
| 297 |
+
| Qwen2.5-VL-7B-Instruct | BF16<br><sup>([π€](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)[π€](https://modelscope.cn/models/qwen/Qwen2.5-VL-7B-Instruct)) | 58.4 | 94.9 | 84.1 | 67.9 |
|
| 298 |
+
| | AWQ<br><sup>([π€](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)[π€](https://modelscope.cn/models/qwen/Qwen2.5-VL-7B-Instruct-AWQ)) | 55.6 | 94.6 | 84.2 | 64.7 |
|
| 299 |
+
| Qwen2.5-VL-3B-Instruct | BF16<br><sup>([π€](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)[π€](https://modelscope.cn/models/qwen/Qwen2.5-VL-3B-Instruct)) | 51.7 | 93.0 | 79.8 | 61.4 |
|
| 300 |
+
| | AWQ<br><sup>([π€](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)[π€](https://modelscope.cn/models/qwen/Qwen2.5-VL-3B-Instruct-AWQ)) | 49.1 | 91.8 | 78.0 | 58.8 |
|
| 301 |
+
|
| 302 |
+
|
| 303 |
+
|
| 304 |
|
| 305 |
## Citation
|
| 306 |
|