kimyoungjune commited on
Commit
a6d9af8
·
verified ·
1 Parent(s): 9dd6057

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -18
README.md CHANGED
@@ -30,7 +30,7 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
30
  ## 🚨News🎙️
31
  - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
32
  - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
33
- - 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
34
  - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
35
  - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
36
 
@@ -38,7 +38,7 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
38
  - **Multi-image Understanding**: Newly added support for multi-image inputs enables the model to analyze multiple images simultaneously and make more holistic and context-aware decisions.
39
  - **Korean Language Specialization**: The model is further specialized for Korean, with a deeper understanding of Korean language, context, and culture. Korean text generation has been significantly improved, resulting in more natural, fluent, and accurate responses.
40
  - **OCR with Text Localization**: Unlike typical models that only recognize and generate text from images, VARCO-VISION-2.0 can also identify the position of the text and provide bounding boxes around it. This makes it especially useful for document understanding, signage interpretation, and structured visual data.
41
- - **Enhanced Safety**: Improved robustness and filtering to ensure safer handling of harmful or sexually explicit content.
42
 
43
  <div align="center">
44
  <img src="./Gimbap_Example-1-20250709-032708.png" width="100%" />
@@ -59,6 +59,16 @@ VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org
59
  We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, **ensuring fair comparisons** with various open-source models.
60
  Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
61
 
 
 
 
 
 
 
 
 
 
 
62
  ### English Benchmark
63
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
64
  | :-------------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
@@ -67,7 +77,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
67
  | MathVista | **74.4** | *73.7* | 68.1 | 62.4 | 73.3 |
68
  | OCRBench | 87.7 | *87.9* | **88.8** | 73.8 | 86.9 |
69
  | AI2D | *86.0* | **86.3** | 84.3 | 81.0 | 85.8 |
70
- | HallusionBench | *55.9* | **56.8** | 51.9 | 51.9 | 53.7 |
71
  | MMVet | **80.5** | 68.4 | *69.7* | 59.4 | 69.4 |
72
  | SEEDBench_IMG | 77.5 | *77.7* | 77.0 | 76.7 | **78.0** |
73
  | LLaVABench | 84.4 | **93.0** | *91.0* | 83.2 | 90.2 |
@@ -76,21 +86,13 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
76
  | ScienceQA_TEST | **98.6** | 95.2 | 89.0 | *95.3* | 93.5 |
77
  | SEEDBench2_Plus | 70.1 | **72.1** | 70.7 | 69.7 | *71.9* |
78
  | BLINK | **59.9** | *59.0* | 55.3 | 46.1 | 54.5 |
79
- | ChartQA_TEST | **87.8** | 79.1 | 80.6 | 79.8 | *84.2* |
80
  | TextVQA_VAL | 82.2 | *83.0* | **85.4** | 82.0 | 80.4 |
 
 
 
81
  | DocVQA_TEST | 94.1 | *94.9* | **95.7** | 94.4 | 90.9 |
82
  | InfoVQA_TEST | **83.6** | *82.8* | 82.6 | 78.5 | 80.4 |
83
- | ***AVERAGE*** | **78.6** | *77.7* | 75.9 | 72.0 | 77.0 |
84
-
85
- ### Korean Benchmark
86
- | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
87
- | :-----------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
88
- | K-MMBench_DEV | **89.1** | 86.0 | 84.7 | 83.9 | *87.7* |
89
- | K-MMStar | **64.9** | 29.7 | 49.3 | 56.3 | *63.6* |
90
- | K-SEED | **78.2** | 73.2 | 75.7 | *76.5* | 77.2 |
91
- | K-LLaVABench | 80.9 | 86.3 | *94.1* | 83.2 | **96.5** |
92
- | K-DTCBench | *87.9* | 81.7 | 82.1 | **90.0** | 78.3 |
93
- | ***AVERAGE*** | *80.2* | 71.4 | 77.2 | 78.0 | **80.7** |
94
 
95
  ### Cultural Benchmark
96
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
@@ -108,7 +110,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
108
  | KoMT-Bench | 70.1 | **79.1** | 68.4 | 68.9 | *78.3* |
109
  | LogicKor | 70.0 | **79.4** | 65.5 | 50.6 | *74.0* |
110
 
111
- **Note**: Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
112
 
113
  ### OCR Benchmark
114
  | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
@@ -249,7 +251,7 @@ conversation = [
249
  "role": "user",
250
  "content": [
251
  {"type": "image", "image": image},
252
- {"type": "text", "text": "<ocr>"},
253
  ],
254
  },
255
  ]
@@ -267,4 +269,4 @@ generate_ids_trimmed = [
267
  output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
268
  print(output)
269
  ```
270
- </details>
 
30
  ## 🚨News🎙️
31
  - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
32
  - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
33
+ - 📰 2025-07-18: We updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
34
  - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
35
  - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
36
 
 
38
  - **Multi-image Understanding**: Newly added support for multi-image inputs enables the model to analyze multiple images simultaneously and make more holistic and context-aware decisions.
39
  - **Korean Language Specialization**: The model is further specialized for Korean, with a deeper understanding of Korean language, context, and culture. Korean text generation has been significantly improved, resulting in more natural, fluent, and accurate responses.
40
  - **OCR with Text Localization**: Unlike typical models that only recognize and generate text from images, VARCO-VISION-2.0 can also identify the position of the text and provide bounding boxes around it. This makes it especially useful for document understanding, signage interpretation, and structured visual data.
41
+ - **Enhanced Safety**: The model now offers improved handling of harmful or sexually explicit content, ensuring safer and more reliable interactions.
42
 
43
  <div align="center">
44
  <img src="./Gimbap_Example-1-20250709-032708.png" width="100%" />
 
59
  We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, **ensuring fair comparisons** with various open-source models.
60
  Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
61
 
62
+ ### Korean Benchmark
63
+ | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
64
+ | :-----------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
65
+ | K-MMBench_DEV | **89.1** | 86.0 | 84.7 | 83.9 | *87.7* |
66
+ | K-MMStar | **64.9** | 29.7 | 49.3 | 56.3 | *63.6* |
67
+ | K-SEED | **78.2** | 73.2 | 75.7 | *76.5* | 77.2 |
68
+ | K-LLaVABench | 80.9 | 86.3 | *94.1* | 83.2 | **96.5** |
69
+ | K-DTCBench | *87.9* | 81.7 | 82.1 | **90.0** | 78.3 |
70
+ | ***AVERAGE*** | *80.2* | 71.4 | 77.2 | 78.0 | **80.7** |
71
+
72
  ### English Benchmark
73
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
74
  | :-------------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
 
77
  | MathVista | **74.4** | *73.7* | 68.1 | 62.4 | 73.3 |
78
  | OCRBench | 87.7 | *87.9* | **88.8** | 73.8 | 86.9 |
79
  | AI2D | *86.0* | **86.3** | 84.3 | 81.0 | 85.8 |
80
+ | HallusionBench | *55.9* | **56.8** | 51.9 | 54.2 | 53.7 |
81
  | MMVet | **80.5** | 68.4 | *69.7* | 59.4 | 69.4 |
82
  | SEEDBench_IMG | 77.5 | *77.7* | 77.0 | 76.7 | **78.0** |
83
  | LLaVABench | 84.4 | **93.0** | *91.0* | 83.2 | 90.2 |
 
86
  | ScienceQA_TEST | **98.6** | 95.2 | 89.0 | *95.3* | 93.5 |
87
  | SEEDBench2_Plus | 70.1 | **72.1** | 70.7 | 69.7 | *71.9* |
88
  | BLINK | **59.9** | *59.0* | 55.3 | 46.1 | 54.5 |
 
89
  | TextVQA_VAL | 82.2 | *83.0* | **85.4** | 82.0 | 80.4 |
90
+ | ChartQA_TEST | **87.8** | 79.1 | 80.6 | 79.8 | *84.2* |
91
+ | Q-Bench1_VAL | 76.5 | *79.2* | 78.2 | 72.5 | **79.9** |
92
+ | A-Bench_VAL | 76.3 | **79.6** | 75.4 | 74.6 | *79.5* |
93
  | DocVQA_TEST | 94.1 | *94.9* | **95.7** | 94.4 | 90.9 |
94
  | InfoVQA_TEST | **83.6** | *82.8* | 82.6 | 78.5 | 80.4 |
95
+ | ***AVERAGE*** | **78.4** | *77.9* | 76.0 | 72.3 | 77.2 |
 
 
 
 
 
 
 
 
 
 
96
 
97
  ### Cultural Benchmark
98
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
 
110
  | KoMT-Bench | 70.1 | **79.1** | 68.4 | 68.9 | *78.3* |
111
  | LogicKor | 70.0 | **79.4** | 65.5 | 50.6 | *74.0* |
112
 
113
+ > **Note:** Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
114
 
115
  ### OCR Benchmark
116
  | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
 
251
  "role": "user",
252
  "content": [
253
  {"type": "image", "image": image},
254
+ {"type": "text", "text": ""},
255
  ],
256
  },
257
  ]
 
269
  output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
270
  print(output)
271
  ```
272
+ </details>