Update README.md
Browse files
README.md
CHANGED
|
@@ -28,8 +28,8 @@ language:
|
|
| 28 |
In addition to the 14B full-scale model, a lightweight 1.7B version is available for on-device use, making it accessible on personal devices such as smartphones and PCs. VARCO-VISION-2.0 is a powerful open-source AI model built for Korean users and is freely available for a wide range of applications.
|
| 29 |
|
| 30 |
## 🚨News🎙️
|
| 31 |
-
-
|
| 32 |
-
-
|
| 33 |
- 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
|
| 34 |
- 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
|
| 35 |
- 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
|
|
@@ -56,16 +56,16 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
|
|
| 56 |
VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
|
| 57 |
|
| 58 |
## Evaluation
|
| 59 |
-
We
|
| 60 |
Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
|
| 61 |
|
| 62 |
### English Benchmark
|
| 63 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B |VARCO-VISION-2.0-14B |
|
| 64 |
| :-----------: | :-----------: | :-------: | :-----------: |:------------------: |
|
| 65 |
| MMStar | **68.9** | *67.2* | 64.1 | 66.5 |
|
| 66 |
-
| SEEDBench_IMG | 77.5 |
|
| 67 |
-
| LLaVABench | 84.4 | **93.0** | *91.0* |
|
| 68 |
-
| OCRBench | 877 | *879* | **888** |
|
| 69 |
|
| 70 |
### Korean Benchmark
|
| 71 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
|
|
@@ -84,7 +84,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
|
|
| 84 |
### Text-only Benchmark
|
| 85 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
|
| 86 |
| :--------: | :-----------: | :-------: | :-----------: | :------------------: |
|
| 87 |
-
| MMLU | **78.5** | *78.4* |
|
| 88 |
| MT-Bench | *8.93* | 8.59 | 8.07 | **8.98** |
|
| 89 |
| KMMLU | *51.4* | 49.3 | 39.6 | **57.5** |
|
| 90 |
| KoMT-Bench | 7.01 | **7.91** | 6.84 | *7.83* |
|
|
@@ -95,9 +95,9 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
|
|
| 95 |
### OCR Benchmark
|
| 96 |
| Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
|
| 97 |
| :-------: | :-------: | :-----: | :------------------: |
|
| 98 |
-
| CORD | *91.4* | 77.8 | **
|
| 99 |
-
| ICDAR2013 | *92.0* | 85.0 | **
|
| 100 |
-
| ICDAR2015 | *73.7* | 57.9 | **
|
| 101 |
|
| 102 |
## Usage
|
| 103 |
To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
|
|
@@ -134,7 +134,6 @@ inputs = processor.apply_chat_template(
|
|
| 134 |
return_dict=True,
|
| 135 |
return_tensors="pt"
|
| 136 |
).to(model.device, torch.float16)
|
| 137 |
-
|
| 138 |
generate_ids = model.generate(**inputs, max_new_tokens=1024)
|
| 139 |
generate_ids_trimmed = [
|
| 140 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
|
|
@@ -157,7 +156,6 @@ conversation = [
|
|
| 157 |
],
|
| 158 |
},
|
| 159 |
]
|
| 160 |
-
|
| 161 |
inputs = processor.apply_chat_template(
|
| 162 |
conversation,
|
| 163 |
add_generation_prompt=True,
|
|
@@ -165,7 +163,6 @@ inputs = processor.apply_chat_template(
|
|
| 165 |
return_dict=True,
|
| 166 |
return_tensors="pt"
|
| 167 |
).to(model.device, torch.float16)
|
| 168 |
-
|
| 169 |
generate_ids = model.generate(**inputs, max_new_tokens=1024)
|
| 170 |
generate_ids_trimmed = [
|
| 171 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
|
|
@@ -177,8 +174,7 @@ print(output)
|
|
| 177 |
|
| 178 |
<details>
|
| 179 |
<summary>Batch inference</summary>
|
| 180 |
-
|
| 181 |
-
All inputs in a batch must have the same modality structure—for example, text-only with text-only, single-image with single-image, and multi-image inputs with the same number of images—to ensure correct batch inference.
|
| 182 |
|
| 183 |
```python
|
| 184 |
conversation_1 = [
|
|
@@ -190,7 +186,6 @@ conversation_1 = [
|
|
| 190 |
],
|
| 191 |
},
|
| 192 |
]
|
| 193 |
-
|
| 194 |
conversation_2 = [
|
| 195 |
{
|
| 196 |
"role": "user",
|
|
@@ -200,7 +195,6 @@ conversation_2 = [
|
|
| 200 |
],
|
| 201 |
},
|
| 202 |
]
|
| 203 |
-
|
| 204 |
inputs = processor.apply_chat_template(
|
| 205 |
[conversation_1, conversation_2],
|
| 206 |
add_generation_prompt=True,
|
|
@@ -209,7 +203,6 @@ inputs = processor.apply_chat_template(
|
|
| 209 |
padding=True,
|
| 210 |
return_tensors="pt"
|
| 211 |
).to(model.device, torch.float16)
|
| 212 |
-
|
| 213 |
generate_ids = model.generate(**inputs, max_new_tokens=1024)
|
| 214 |
generate_ids_trimmed = [
|
| 215 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
|
|
@@ -224,9 +217,7 @@ print(output)
|
|
| 224 |
|
| 225 |
```python
|
| 226 |
from PIL import Image
|
| 227 |
-
|
| 228 |
image = Image.open("file:///path/to/image.jpg")
|
| 229 |
-
|
| 230 |
# Image upscaling for OCR performance boost
|
| 231 |
w, h = image.size
|
| 232 |
target_size = 2304
|
|
@@ -235,17 +226,15 @@ if max(w, h) < target_size:
|
|
| 235 |
new_w = int(w * scaling_factor)
|
| 236 |
new_h = int(h * scaling_factor)
|
| 237 |
image = image.resize((new_w, new_h))
|
| 238 |
-
|
| 239 |
conversation = [
|
| 240 |
{
|
| 241 |
"role": "user",
|
| 242 |
"content": [
|
| 243 |
{"type": "image", "image": image},
|
| 244 |
-
{"type": "text", "text": "
|
| 245 |
],
|
| 246 |
},
|
| 247 |
]
|
| 248 |
-
|
| 249 |
inputs = processor.apply_chat_template(
|
| 250 |
conversation,
|
| 251 |
add_generation_prompt=True,
|
|
@@ -253,7 +242,6 @@ inputs = processor.apply_chat_template(
|
|
| 253 |
return_dict=True,
|
| 254 |
return_tensors="pt"
|
| 255 |
).to(model.device, torch.float16)
|
| 256 |
-
|
| 257 |
generate_ids = model.generate(**inputs, max_new_tokens=1024)
|
| 258 |
generate_ids_trimmed = [
|
| 259 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
|
|
|
|
| 28 |
In addition to the 14B full-scale model, a lightweight 1.7B version is available for on-device use, making it accessible on personal devices such as smartphones and PCs. VARCO-VISION-2.0 is a powerful open-source AI model built for Korean users and is freely available for a wide range of applications.
|
| 29 |
|
| 30 |
## 🚨News🎙️
|
| 31 |
+
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
|
| 32 |
+
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
|
| 33 |
- 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
|
| 34 |
- 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
|
| 35 |
- 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
|
|
|
|
| 56 |
VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
|
| 57 |
|
| 58 |
## Evaluation
|
| 59 |
+
We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, ensuring fair comparisons with various open-source models.
|
| 60 |
Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
|
| 61 |
|
| 62 |
### English Benchmark
|
| 63 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B |VARCO-VISION-2.0-14B |
|
| 64 |
| :-----------: | :-----------: | :-------: | :-----------: |:------------------: |
|
| 65 |
| MMStar | **68.9** | *67.2* | 64.1 | 66.5 |
|
| 66 |
+
| SEEDBench_IMG | 77.5 | *77.7* | 77.0 | **78.0** |
|
| 67 |
+
| LLaVABench | 84.4 | **93.0** | *91.0* | 90.2 |
|
| 68 |
+
| OCRBench | 877 | *879* | **888** | 869 |
|
| 69 |
|
| 70 |
### Korean Benchmark
|
| 71 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
|
|
|
|
| 84 |
### Text-only Benchmark
|
| 85 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
|
| 86 |
| :--------: | :-----------: | :-------: | :-----------: | :------------------: |
|
| 87 |
+
| MMLU | **78.5** | *78.4* | 4.6 | 77.9 |
|
| 88 |
| MT-Bench | *8.93* | 8.59 | 8.07 | **8.98** |
|
| 89 |
| KMMLU | *51.4* | 49.3 | 39.6 | **57.5** |
|
| 90 |
| KoMT-Bench | 7.01 | **7.91** | 6.84 | *7.83* |
|
|
|
|
| 95 |
### OCR Benchmark
|
| 96 |
| Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
|
| 97 |
| :-------: | :-------: | :-----: | :------------------: |
|
| 98 |
+
| CORD | *91.4* | 77.8 | **97.1** |
|
| 99 |
+
| ICDAR2013 | *92.0* | 85.0 | **95.7** |
|
| 100 |
+
| ICDAR2015 | *73.7* | 57.9 | **79.4** |
|
| 101 |
|
| 102 |
## Usage
|
| 103 |
To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
|
|
|
|
| 134 |
return_dict=True,
|
| 135 |
return_tensors="pt"
|
| 136 |
).to(model.device, torch.float16)
|
|
|
|
| 137 |
generate_ids = model.generate(**inputs, max_new_tokens=1024)
|
| 138 |
generate_ids_trimmed = [
|
| 139 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
|
|
|
|
| 156 |
],
|
| 157 |
},
|
| 158 |
]
|
|
|
|
| 159 |
inputs = processor.apply_chat_template(
|
| 160 |
conversation,
|
| 161 |
add_generation_prompt=True,
|
|
|
|
| 163 |
return_dict=True,
|
| 164 |
return_tensors="pt"
|
| 165 |
).to(model.device, torch.float16)
|
|
|
|
| 166 |
generate_ids = model.generate(**inputs, max_new_tokens=1024)
|
| 167 |
generate_ids_trimmed = [
|
| 168 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
|
|
|
|
| 174 |
|
| 175 |
<details>
|
| 176 |
<summary>Batch inference</summary>
|
| 177 |
+
All inputs in a batch must have the same modality structure—for example, text-only with text-only, single-image with single-image, and multi-image with multi-image—to ensure correct batch inference.
|
|
|
|
| 178 |
|
| 179 |
```python
|
| 180 |
conversation_1 = [
|
|
|
|
| 186 |
],
|
| 187 |
},
|
| 188 |
]
|
|
|
|
| 189 |
conversation_2 = [
|
| 190 |
{
|
| 191 |
"role": "user",
|
|
|
|
| 195 |
],
|
| 196 |
},
|
| 197 |
]
|
|
|
|
| 198 |
inputs = processor.apply_chat_template(
|
| 199 |
[conversation_1, conversation_2],
|
| 200 |
add_generation_prompt=True,
|
|
|
|
| 203 |
padding=True,
|
| 204 |
return_tensors="pt"
|
| 205 |
).to(model.device, torch.float16)
|
|
|
|
| 206 |
generate_ids = model.generate(**inputs, max_new_tokens=1024)
|
| 207 |
generate_ids_trimmed = [
|
| 208 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
|
|
|
|
| 217 |
|
| 218 |
```python
|
| 219 |
from PIL import Image
|
|
|
|
| 220 |
image = Image.open("file:///path/to/image.jpg")
|
|
|
|
| 221 |
# Image upscaling for OCR performance boost
|
| 222 |
w, h = image.size
|
| 223 |
target_size = 2304
|
|
|
|
| 226 |
new_w = int(w * scaling_factor)
|
| 227 |
new_h = int(h * scaling_factor)
|
| 228 |
image = image.resize((new_w, new_h))
|
|
|
|
| 229 |
conversation = [
|
| 230 |
{
|
| 231 |
"role": "user",
|
| 232 |
"content": [
|
| 233 |
{"type": "image", "image": image},
|
| 234 |
+
{"type": "text", "text": ""},
|
| 235 |
],
|
| 236 |
},
|
| 237 |
]
|
|
|
|
| 238 |
inputs = processor.apply_chat_template(
|
| 239 |
conversation,
|
| 240 |
add_generation_prompt=True,
|
|
|
|
| 242 |
return_dict=True,
|
| 243 |
return_tensors="pt"
|
| 244 |
).to(model.device, torch.float16)
|
|
|
|
| 245 |
generate_ids = model.generate(**inputs, max_new_tokens=1024)
|
| 246 |
generate_ids_trimmed = [
|
| 247 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
|