NCSOFT
/

VARCO-VISION-2.0-14B

@@ -28,8 +28,8 @@ language:
 In addition to the 14B full-scale model, a lightweight 1.7B version is available for on-device use, making it accessible on personal devices such as smartphones and PCs. VARCO-VISION-2.0 is a powerful open-source AI model built for Korean users and is freely available for a wide range of applications.
 ## 🚨News🎙️
-- 👀 We are going to release VARCO-VISION-2.0-1.7B-OCR soon!
-- 👀 We are going to release VARCO-VISION-2.0-1.7B soon!
 - 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
 - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
 - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
@@ -56,16 +56,16 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
 VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
 ## Evaluation
-We adopted benchmark scores directly from [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) where available, and conducted our own evaluations for benchmarks not included in OpenVLM Leaderboard, comparing results against various open-source models to provide a fair and comprehensive evaluation.
 Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
 ### English Benchmark
 | Benchmark     | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B |VARCO-VISION-2.0-14B |
 | :-----------: | :-----------: | :-------: | :-----------: |:------------------: |
 | MMStar        | **68.9**      | *67.2*    | 64.1          | 66.5                |
-| SEEDBench_IMG | 77.5          | **77.7**  | 77.0          | **77.7**            |
-| LLaVABench    | 84.4          | **93.0**  | *91.0*        | 88.0                |
-| OCRBench      | 877           | *879*     | **888**       | 860                 |
 ### Korean Benchmark
 | Benchmark    | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
@@ -84,7 +84,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
 ### Text-only Benchmark
 | Benchmark  | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
 | :--------: | :-----------: | :-------: | :-----------: | :------------------: |
-| MMLU       | **78.5**      | *78.4*    |  4.6          | 77.9                 |
 | MT-Bench   | *8.93*        | 8.59      | 8.07          | **8.98**             |
 | KMMLU      | *51.4*        | 49.3      | 39.6          | **57.5**             |
 | KoMT-Bench | 7.01          | **7.91**  | 6.84          | *7.83*               |
@@ -95,9 +95,9 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
 ### OCR Benchmark
 | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
 | :-------: | :-------: | :-----: | :------------------: |
-| CORD      | *91.4*    | 77.8    | **93.1**             |
-| ICDAR2013 | *92.0*    | 85.0    | **93.2**             |
-| ICDAR2015 | *73.7*    | 57.9    | **82.4**             |
 ## Usage
 To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
@@ -134,7 +134,6 @@ inputs = processor.apply_chat_template(
     return_dict=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024)
 generate_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
@@ -157,7 +156,6 @@ conversation = [
         ],
     },
 ]
 inputs = processor.apply_chat_template(
     conversation,
     add_generation_prompt=True,
@@ -165,7 +163,6 @@ inputs = processor.apply_chat_template(
     return_dict=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024)
 generate_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
@@ -177,8 +174,7 @@ print(output)
 <details>
 <summary>Batch inference</summary>
-All inputs in a batch must have the same modality structure—for example, text-only with text-only, single-image with single-image, and multi-image inputs with the same number of images—to ensure correct batch inference.
 ```python
 conversation_1 = [
@@ -190,7 +186,6 @@ conversation_1 = [
         ],
     },
 ]
 conversation_2 = [
     {
         "role": "user",
@@ -200,7 +195,6 @@ conversation_2 = [
         ],
     },
 ]
 inputs = processor.apply_chat_template(
     [conversation_1, conversation_2],
     add_generation_prompt=True,
@@ -209,7 +203,6 @@ inputs = processor.apply_chat_template(
     padding=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024)
 generate_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
@@ -224,9 +217,7 @@ print(output)
 ```python
 from PIL import Image
 image = Image.open("file:///path/to/image.jpg")
 # Image upscaling for OCR performance boost
 w, h = image.size
 target_size = 2304
@@ -235,17 +226,15 @@ if max(w, h) < target_size:
     new_w = int(w * scaling_factor)
     new_h = int(h * scaling_factor)
     image = image.resize((new_w, new_h))
 conversation = [
     {
         "role": "user",
         "content": [
             {"type": "image", "image": image},
-            {"type": "text", "text": "<ocr>"},
         ],
     },
 ]
 inputs = processor.apply_chat_template(
     conversation,
     add_generation_prompt=True,
@@ -253,7 +242,6 @@ inputs = processor.apply_chat_template(
     return_dict=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024)
 generate_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)

 In addition to the 14B full-scale model, a lightweight 1.7B version is available for on-device use, making it accessible on personal devices such as smartphones and PCs. VARCO-VISION-2.0 is a powerful open-source AI model built for Korean users and is freely available for a wide range of applications.
 ## 🚨News🎙️
+- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
+- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
 - 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
 - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
 - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
 VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
 ## Evaluation
+We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, ensuring fair comparisons with various open-source models.
 Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
 ### English Benchmark
 | Benchmark     | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B |VARCO-VISION-2.0-14B |
 | :-----------: | :-----------: | :-------: | :-----------: |:------------------: |
 | MMStar        | **68.9**      | *67.2*    | 64.1          | 66.5                |
+| SEEDBench_IMG | 77.5          | *77.7*    | 77.0          | **78.0**            |
+| LLaVABench    | 84.4          | **93.0**  | *91.0*        | 90.2                |
+| OCRBench      | 877           | *879*     | **888**       | 869                 |
 ### Korean Benchmark
 | Benchmark    | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
 ### Text-only Benchmark
 | Benchmark  | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
 | :--------: | :-----------: | :-------: | :-----------: | :------------------: |
+| MMLU       | **78.5**      | *78.4*    | 4.6           | 77.9                 |
 | MT-Bench   | *8.93*        | 8.59      | 8.07          | **8.98**             |
 | KMMLU      | *51.4*        | 49.3      | 39.6          | **57.5**             |
 | KoMT-Bench | 7.01          | **7.91**  | 6.84          | *7.83*               |
 ### OCR Benchmark
 | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
 | :-------: | :-------: | :-----: | :------------------: |
+| CORD      | *91.4*    | 77.8    | **97.1**             |
+| ICDAR2013 | *92.0*    | 85.0    | **95.7**             |
+| ICDAR2015 | *73.7*    | 57.9    | **79.4**             |
 ## Usage
 To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
     return_dict=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024)
 generate_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
         ],
     },
 ]
 inputs = processor.apply_chat_template(
     conversation,
     add_generation_prompt=True,
     return_dict=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024)
 generate_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
 <details>
 <summary>Batch inference</summary>
+All inputs in a batch must have the same modality structure—for example, text-only with text-only, single-image with single-image, and multi-image with multi-image—to ensure correct batch inference.
 ```python
 conversation_1 = [
         ],
     },
 ]
 conversation_2 = [
     {
         "role": "user",
         ],
     },
 ]
 inputs = processor.apply_chat_template(
     [conversation_1, conversation_2],
     add_generation_prompt=True,
     padding=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024)
 generate_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
 ```python
 from PIL import Image
 image = Image.open("file:///path/to/image.jpg")
 # Image upscaling for OCR performance boost
 w, h = image.size
 target_size = 2304
     new_w = int(w * scaling_factor)
     new_h = int(h * scaling_factor)
     image = image.resize((new_w, new_h))
 conversation = [
     {
         "role": "user",
         "content": [
             {"type": "image", "image": image},
+            {"type": "text", "text": ""},
         ],
     },
 ]
 inputs = processor.apply_chat_template(
     conversation,
     add_generation_prompt=True,
     return_dict=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024)
 generate_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)