NCSOFT
/

VARCO-VISION-2.0-14B

@@ -30,7 +30,7 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
 ## 🚨News🎙️
 - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
 - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
-- 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
 - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
 - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
@@ -38,7 +38,7 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
 - **Multi-image Understanding**: Newly added support for multi-image inputs enables the model to analyze multiple images simultaneously and make more holistic and context-aware decisions.
 - **Korean Language Specialization**: The model is further specialized for Korean, with a deeper understanding of Korean language, context, and culture. Korean text generation has been significantly improved, resulting in more natural, fluent, and accurate responses.
 - **OCR with Text Localization**: Unlike typical models that only recognize and generate text from images, VARCO-VISION-2.0 can also identify the position of the text and provide bounding boxes around it. This makes it especially useful for document understanding, signage interpretation, and structured visual data.
-- **Enhanced Safety**: Improved robustness and filtering to ensure safer handling of harmful or sexually explicit content.
 <div align="center">
     <img src="./Gimbap_Example-1-20250709-032708.png" width="100%" />
@@ -59,6 +59,16 @@ VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org
 We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, **ensuring fair comparisons** with various open-source models.
 Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
 ### English Benchmark
 | Benchmark       | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
 | :-------------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
@@ -67,7 +77,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
 | MathVista       | **74.4**      | *73.7*    | 68.1          | 62.4             | 73.3                 |
 | OCRBench        | 87.7          | *87.9*    | **88.8**      | 73.8             | 86.9                 |
 | AI2D            | *86.0*        | **86.3**  | 84.3          | 81.0             | 85.8                 |
-| HallusionBench  | *55.9*        | **56.8**  | 51.9          | 51.9             | 53.7                 |
 | MMVet           | **80.5**      | 68.4      | *69.7*        | 59.4             | 69.4                 |
 | SEEDBench_IMG   | 77.5          | *77.7*    | 77.0          | 76.7             | **78.0**             |
 | LLaVABench      | 84.4          | **93.0**  | *91.0*        | 83.2             | 90.2                 |
@@ -76,21 +86,13 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
 | ScienceQA_TEST  | **98.6**      | 95.2      | 89.0          | *95.3*           | 93.5                 |
 | SEEDBench2_Plus | 70.1          | **72.1**  | 70.7          | 69.7             | *71.9*               |
 | BLINK           | **59.9**      | *59.0*    | 55.3          | 46.1             | 54.5                 |
-| ChartQA_TEST    | **87.8**      | 79.1      | 80.6          | 79.8             | *84.2*               |
 | TextVQA_VAL     | 82.2          | *83.0*    | **85.4**      | 82.0             | 80.4                 |
 | DocVQA_TEST     | 94.1          | *94.9*    | **95.7**      | 94.4             | 90.9                 |
 | InfoVQA_TEST    | **83.6**      | *82.8*    | 82.6          | 78.5             | 80.4                 |
-| ***AVERAGE***   | **78.6**      | *77.7*    | 75.9          | 72.0             | 77.0                 |
-### Korean Benchmark
-| Benchmark     | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
-| :-----------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
-| K-MMBench_DEV | **89.1**      | 86.0      | 84.7          | 83.9             | *87.7*               |
-| K-MMStar      | **64.9**      | 29.7      | 49.3          | 56.3             | *63.6*               |
-| K-SEED        | **78.2**      | 73.2      | 75.7          | *76.5*           | 77.2                 |
-| K-LLaVABench  | 80.9          | 86.3      | *94.1*        | 83.2             | **96.5**             |
-| K-DTCBench    | *87.9*        | 81.7      | 82.1          | **90.0**         | 78.3                 |
-| ***AVERAGE*** | *80.2*        | 71.4      | 77.2          | 78.0             | **80.7**             |
 ### Cultural Benchmark
 | Benchmark        | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
@@ -108,7 +110,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
 | KoMT-Bench | 70.1          | **79.1**  | 68.4          | 68.9             | *78.3*               |
 | LogicKor   | 70.0          | **79.4**  | 65.5          | 50.6             | *74.0*               |
-**Note**: Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
 ### OCR Benchmark
 | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
@@ -249,7 +251,7 @@ conversation = [
         "role": "user",
         "content": [
             {"type": "image", "image": image},
-            {"type": "text", "text": "<ocr>"},
         ],
     },
 ]
@@ -267,4 +269,4 @@ generate_ids_trimmed = [
 output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
 print(output)
 ```
-</details>

 ## 🚨News🎙️
 - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
 - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
+- 📰 2025-07-18: We updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
 - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
 - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
 - **Multi-image Understanding**: Newly added support for multi-image inputs enables the model to analyze multiple images simultaneously and make more holistic and context-aware decisions.
 - **Korean Language Specialization**: The model is further specialized for Korean, with a deeper understanding of Korean language, context, and culture. Korean text generation has been significantly improved, resulting in more natural, fluent, and accurate responses.
 - **OCR with Text Localization**: Unlike typical models that only recognize and generate text from images, VARCO-VISION-2.0 can also identify the position of the text and provide bounding boxes around it. This makes it especially useful for document understanding, signage interpretation, and structured visual data.
+- **Enhanced Safety**: The model now offers improved handling of harmful or sexually explicit content, ensuring safer and more reliable interactions.
 <div align="center">
     <img src="./Gimbap_Example-1-20250709-032708.png" width="100%" />
 We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, **ensuring fair comparisons** with various open-source models.
 Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
+### Korean Benchmark
+| Benchmark     | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
+| :-----------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
+| K-MMBench_DEV | **89.1**      | 86.0      | 84.7          | 83.9             | *87.7*               |
+| K-MMStar      | **64.9**      | 29.7      | 49.3          | 56.3             | *63.6*               |
+| K-SEED        | **78.2**      | 73.2      | 75.7          | *76.5*           | 77.2                 |
+| K-LLaVABench  | 80.9          | 86.3      | *94.1*        | 83.2             | **96.5**             |
+| K-DTCBench    | *87.9*        | 81.7      | 82.1          | **90.0**         | 78.3                 |
+| ***AVERAGE*** | *80.2*        | 71.4      | 77.2          | 78.0             | **80.7**             |
 ### English Benchmark
 | Benchmark       | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
 | :-------------: | :-----------: | :-------: | :-----------: | :--------------: | :------------------: |
 | MathVista       | **74.4**      | *73.7*    | 68.1          | 62.4             | 73.3                 |
 | OCRBench        | 87.7          | *87.9*    | **88.8**      | 73.8             | 86.9                 |
 | AI2D            | *86.0*        | **86.3**  | 84.3          | 81.0             | 85.8                 |
+| HallusionBench  | *55.9*        | **56.8**  | 51.9          | 54.2             | 53.7                 |
 | MMVet           | **80.5**      | 68.4      | *69.7*        | 59.4             | 69.4                 |
 | SEEDBench_IMG   | 77.5          | *77.7*    | 77.0          | 76.7             | **78.0**             |
 | LLaVABench      | 84.4          | **93.0**  | *91.0*        | 83.2             | 90.2                 |
 | ScienceQA_TEST  | **98.6**      | 95.2      | 89.0          | *95.3*           | 93.5                 |
 | SEEDBench2_Plus | 70.1          | **72.1**  | 70.7          | 69.7             | *71.9*               |
 | BLINK           | **59.9**      | *59.0*    | 55.3          | 46.1             | 54.5                 |
 | TextVQA_VAL     | 82.2          | *83.0*    | **85.4**      | 82.0             | 80.4                 |
+| ChartQA_TEST    | **87.8**      | 79.1      | 80.6          | 79.8             | *84.2*               |
+| Q-Bench1_VAL    | 76.5          | *79.2*    | 78.2          | 72.5             | **79.9**             |
+| A-Bench_VAL     | 76.3          | **79.6**  | 75.4          | 74.6             | *79.5*               |
 | DocVQA_TEST     | 94.1          | *94.9*    | **95.7**      | 94.4             | 90.9                 |
 | InfoVQA_TEST    | **83.6**      | *82.8*    | 82.6          | 78.5             | 80.4                 |
+| ***AVERAGE***   | **78.4**      | *77.9*    | 76.0          | 72.3             | 77.2                 |
 ### Cultural Benchmark
 | Benchmark        | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | A.X 4.0 VL Light | VARCO-VISION-2.0-14B |
 | KoMT-Bench | 70.1          | **79.1**  | 68.4          | 68.9             | *78.3*               |
 | LogicKor   | 70.0          | **79.4**  | 65.5          | 50.6             | *74.0*               |
+> **Note:** Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
 ### OCR Benchmark
 | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
         "role": "user",
         "content": [
             {"type": "image", "image": image},
+            {"type": "text", "text": ""},
         ],
     },
 ]
 output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
 print(output)
 ```
+</details>