Update README.md
Browse files
README.md
CHANGED
|
@@ -30,7 +30,7 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
|
|
| 30 |
## 🚨News🎙️
|
| 31 |
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
|
| 32 |
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
|
| 33 |
-
- 📰 2025-07-18:
|
| 34 |
- 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
|
| 35 |
- 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
|
| 36 |
|
|
@@ -38,7 +38,7 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
|
|
| 38 |
- **Multi-image Understanding**: Newly added support for multi-image inputs enables the model to analyze multiple images simultaneously and make more holistic and context-aware decisions.
|
| 39 |
- **Korean Language Specialization**: The model is further specialized for Korean, with a deeper understanding of Korean language, context, and culture. Korean text generation has been significantly improved, resulting in more natural, fluent, and accurate responses.
|
| 40 |
- **OCR with Text Localization**: Unlike typical models that only recognize and generate text from images, VARCO-VISION-2.0 can also identify the position of the text and provide bounding boxes around it. This makes it especially useful for document understanding, signage interpretation, and structured visual data.
|
| 41 |
-
- **Enhanced Safety**:
|
| 42 |
|
| 43 |
<div align="center">
|
| 44 |
<img src="./Gimbap_Example-1-20250709-032708.png" width="100%" />
|
|
@@ -59,6 +59,16 @@ VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org
|
|
| 59 |
We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, **ensuring fair comparisons** with various open-source models.
|
| 60 |
Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
### English Benchmark
|
| 63 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
|
| 64 |
| :-------------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
|
|
@@ -67,7 +77,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
|
|
| 67 |
| MathVista | **74.4** | *73.7* | 68.1 | 62.4 | 73.3 |
|
| 68 |
| OCRBench | 87.7 | *87.9* | **88.8** | 73.8 | 86.9 |
|
| 69 |
| AI2D | *86.0* | **86.3** | 84.3 | 81.0 | 85.8 |
|
| 70 |
-
| HallusionBench | *55.9* | **56.8** | 51.9 |
|
| 71 |
| MMVet | **80.5** | 68.4 | *69.7* | 59.4 | 69.4 |
|
| 72 |
| SEEDBench_IMG | 77.5 | *77.7* | 77.0 | 76.7 | **78.0** |
|
| 73 |
| LLaVABench | 84.4 | **93.0** | *91.0* | 83.2 | 90.2 |
|
|
@@ -76,21 +86,13 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
|
|
| 76 |
| ScienceQA_TEST | **98.6** | 95.2 | 89.0 | *95.3* | 93.5 |
|
| 77 |
| SEEDBench2_Plus | 70.1 | **72.1** | 70.7 | 69.7 | *71.9* |
|
| 78 |
| BLINK | **59.9** | *59.0* | 55.3 | 46.1 | 54.5 |
|
| 79 |
-
| ChartQA_TEST | **87.8** | 79.1 | 80.6 | 79.8 | *84.2* |
|
| 80 |
| TextVQA_VAL | 82.2 | *83.0* | **85.4** | 82.0 | 80.4 |
|
|
|
|
|
|
|
|
|
|
| 81 |
| DocVQA_TEST | 94.1 | *94.9* | **95.7** | 94.4 | 90.9 |
|
| 82 |
| InfoVQA_TEST | **83.6** | *82.8* | 82.6 | 78.5 | 80.4 |
|
| 83 |
-
| ***AVERAGE*** | **78.
|
| 84 |
-
|
| 85 |
-
### Korean Benchmark
|
| 86 |
-
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
|
| 87 |
-
| :-----------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
|
| 88 |
-
| K-MMBench_DEV | **89.1** | 86.0 | 84.7 | 83.9 | *87.7* |
|
| 89 |
-
| K-MMStar | **64.9** | 29.7 | 49.3 | 56.3 | *63.6* |
|
| 90 |
-
| K-SEED | **78.2** | 73.2 | 75.7 | *76.5* | 77.2 |
|
| 91 |
-
| K-LLaVABench | 80.9 | 86.3 | *94.1* | 83.2 | **96.5** |
|
| 92 |
-
| K-DTCBench | *87.9* | 81.7 | 82.1 | **90.0** | 78.3 |
|
| 93 |
-
| ***AVERAGE*** | *80.2* | 71.4 | 77.2 | 78.0 | **80.7** |
|
| 94 |
|
| 95 |
### Cultural Benchmark
|
| 96 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
|
|
@@ -108,7 +110,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
|
|
| 108 |
| KoMT-Bench | 70.1 | **79.1** | 68.4 | 68.9 | *78.3* |
|
| 109 |
| LogicKor | 70.0 | **79.4** | 65.5 | 50.6 | *74.0* |
|
| 110 |
|
| 111 |
-
**Note
|
| 112 |
|
| 113 |
### OCR Benchmark
|
| 114 |
| Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
|
|
@@ -249,7 +251,7 @@ conversation = [
|
|
| 249 |
"role": "user",
|
| 250 |
"content": [
|
| 251 |
{"type": "image", "image": image},
|
| 252 |
-
{"type": "text", "text": "
|
| 253 |
],
|
| 254 |
},
|
| 255 |
]
|
|
@@ -267,4 +269,4 @@ generate_ids_trimmed = [
|
|
| 267 |
output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
|
| 268 |
print(output)
|
| 269 |
```
|
| 270 |
-
</details>
|
|
|
|
| 30 |
## 🚨News🎙️
|
| 31 |
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
|
| 32 |
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
|
| 33 |
+
- 📰 2025-07-18: We updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
|
| 34 |
- 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
|
| 35 |
- 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
|
| 36 |
|
|
|
|
| 38 |
- **Multi-image Understanding**: Newly added support for multi-image inputs enables the model to analyze multiple images simultaneously and make more holistic and context-aware decisions.
|
| 39 |
- **Korean Language Specialization**: The model is further specialized for Korean, with a deeper understanding of Korean language, context, and culture. Korean text generation has been significantly improved, resulting in more natural, fluent, and accurate responses.
|
| 40 |
- **OCR with Text Localization**: Unlike typical models that only recognize and generate text from images, VARCO-VISION-2.0 can also identify the position of the text and provide bounding boxes around it. This makes it especially useful for document understanding, signage interpretation, and structured visual data.
|
| 41 |
+
- **Enhanced Safety**: The model now offers improved handling of harmful or sexually explicit content, ensuring safer and more reliable interactions.
|
| 42 |
|
| 43 |
<div align="center">
|
| 44 |
<img src="./Gimbap_Example-1-20250709-032708.png" width="100%" />
|
|
|
|
| 59 |
We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, **ensuring fair comparisons** with various open-source models.
|
| 60 |
Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
|
| 61 |
|
| 62 |
+
### Korean Benchmark
|
| 63 |
+
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
|
| 64 |
+
| :-----------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
|
| 65 |
+
| K-MMBench_DEV | **89.1** | 86.0 | 84.7 | 83.9 | *87.7* |
|
| 66 |
+
| K-MMStar | **64.9** | 29.7 | 49.3 | 56.3 | *63.6* |
|
| 67 |
+
| K-SEED | **78.2** | 73.2 | 75.7 | *76.5* | 77.2 |
|
| 68 |
+
| K-LLaVABench | 80.9 | 86.3 | *94.1* | 83.2 | **96.5** |
|
| 69 |
+
| K-DTCBench | *87.9* | 81.7 | 82.1 | **90.0** | 78.3 |
|
| 70 |
+
| ***AVERAGE*** | *80.2* | 71.4 | 77.2 | 78.0 | **80.7** |
|
| 71 |
+
|
| 72 |
### English Benchmark
|
| 73 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
|
| 74 |
| :-------------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
|
|
|
|
| 77 |
| MathVista | **74.4** | *73.7* | 68.1 | 62.4 | 73.3 |
|
| 78 |
| OCRBench | 87.7 | *87.9* | **88.8** | 73.8 | 86.9 |
|
| 79 |
| AI2D | *86.0* | **86.3** | 84.3 | 81.0 | 85.8 |
|
| 80 |
+
| HallusionBench | *55.9* | **56.8** | 51.9 | 54.2 | 53.7 |
|
| 81 |
| MMVet | **80.5** | 68.4 | *69.7* | 59.4 | 69.4 |
|
| 82 |
| SEEDBench_IMG | 77.5 | *77.7* | 77.0 | 76.7 | **78.0** |
|
| 83 |
| LLaVABench | 84.4 | **93.0** | *91.0* | 83.2 | 90.2 |
|
|
|
|
| 86 |
| ScienceQA_TEST | **98.6** | 95.2 | 89.0 | *95.3* | 93.5 |
|
| 87 |
| SEEDBench2_Plus | 70.1 | **72.1** | 70.7 | 69.7 | *71.9* |
|
| 88 |
| BLINK | **59.9** | *59.0* | 55.3 | 46.1 | 54.5 |
|
|
|
|
| 89 |
| TextVQA_VAL | 82.2 | *83.0* | **85.4** | 82.0 | 80.4 |
|
| 90 |
+
| ChartQA_TEST | **87.8** | 79.1 | 80.6 | 79.8 | *84.2* |
|
| 91 |
+
| Q-Bench1_VAL | 76.5 | *79.2* | 78.2 | 72.5 | **79.9** |
|
| 92 |
+
| A-Bench_VAL | 76.3 | **79.6** | 75.4 | 74.6 | *79.5* |
|
| 93 |
| DocVQA_TEST | 94.1 | *94.9* | **95.7** | 94.4 | 90.9 |
|
| 94 |
| InfoVQA_TEST | **83.6** | *82.8* | 82.6 | 78.5 | 80.4 |
|
| 95 |
+
| ***AVERAGE*** | **78.4** | *77.9* | 76.0 | 72.3 | 77.2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
### Cultural Benchmark
|
| 98 |
| Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
|
|
|
|
| 110 |
| KoMT-Bench | 70.1 | **79.1** | 68.4 | 68.9 | *78.3* |
|
| 111 |
| LogicKor | 70.0 | **79.4** | 65.5 | 50.6 | *74.0* |
|
| 112 |
|
| 113 |
+
> **Note:** Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
|
| 114 |
|
| 115 |
### OCR Benchmark
|
| 116 |
| Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
|
|
|
|
| 251 |
"role": "user",
|
| 252 |
"content": [
|
| 253 |
{"type": "image", "image": image},
|
| 254 |
+
{"type": "text", "text": ""},
|
| 255 |
],
|
| 256 |
},
|
| 257 |
]
|
|
|
|
| 269 |
output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
|
| 270 |
print(output)
|
| 271 |
```
|
| 272 |
+
</details>
|