kimyoungjune commited on
Commit
4d04422
·
verified ·
1 Parent(s): 29da236

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -24
README.md CHANGED
@@ -28,8 +28,8 @@ language:
28
  In addition to the 14B full-scale model, a lightweight 1.7B version is available for on-device use, making it accessible on personal devices such as smartphones and PCs. VARCO-VISION-2.0 is a powerful open-source AI model built for Korean users and is freely available for a wide range of applications.
29
 
30
  ## 🚨News🎙️
31
- - 👀 We are going to release VARCO-VISION-2.0-1.7B-OCR soon!
32
- - 👀 We are going to release VARCO-VISION-2.0-1.7B soon!
33
  - 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
34
  - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
35
  - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
@@ -56,16 +56,16 @@ In addition to the 14B full-scale model, a lightweight 1.7B version is available
56
  VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
57
 
58
  ## Evaluation
59
- We adopted benchmark scores directly from [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) where available, and conducted our own evaluations for benchmarks not included in OpenVLM Leaderboard, comparing results against various open-source models to provide a fair and comprehensive evaluation.
60
  Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
61
 
62
  ### English Benchmark
63
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B |VARCO-VISION-2.0-14B |
64
  | :-----------: | :-----------: | :-------: | :-----------: |:------------------: |
65
  | MMStar | **68.9** | *67.2* | 64.1 | 66.5 |
66
- | SEEDBench_IMG | 77.5 | **77.7** | 77.0 | **77.7** |
67
- | LLaVABench | 84.4 | **93.0** | *91.0* | 88.0 |
68
- | OCRBench | 877 | *879* | **888** | 860 |
69
 
70
  ### Korean Benchmark
71
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
@@ -84,7 +84,7 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
84
  ### Text-only Benchmark
85
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
86
  | :--------: | :-----------: | :-------: | :-----------: | :------------------: |
87
- | MMLU | **78.5** | *78.4* | 4.6 | 77.9 |
88
  | MT-Bench | *8.93* | 8.59 | 8.07 | **8.98** |
89
  | KMMLU | *51.4* | 49.3 | 39.6 | **57.5** |
90
  | KoMT-Bench | 7.01 | **7.91** | 6.84 | *7.83* |
@@ -95,9 +95,9 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
95
  ### OCR Benchmark
96
  | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
97
  | :-------: | :-------: | :-----: | :------------------: |
98
- | CORD | *91.4* | 77.8 | **93.1** |
99
- | ICDAR2013 | *92.0* | 85.0 | **93.2** |
100
- | ICDAR2015 | *73.7* | 57.9 | **82.4** |
101
 
102
  ## Usage
103
  To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
@@ -134,7 +134,6 @@ inputs = processor.apply_chat_template(
134
  return_dict=True,
135
  return_tensors="pt"
136
  ).to(model.device, torch.float16)
137
-
138
  generate_ids = model.generate(**inputs, max_new_tokens=1024)
139
  generate_ids_trimmed = [
140
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
@@ -157,7 +156,6 @@ conversation = [
157
  ],
158
  },
159
  ]
160
-
161
  inputs = processor.apply_chat_template(
162
  conversation,
163
  add_generation_prompt=True,
@@ -165,7 +163,6 @@ inputs = processor.apply_chat_template(
165
  return_dict=True,
166
  return_tensors="pt"
167
  ).to(model.device, torch.float16)
168
-
169
  generate_ids = model.generate(**inputs, max_new_tokens=1024)
170
  generate_ids_trimmed = [
171
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
@@ -177,8 +174,7 @@ print(output)
177
 
178
  <details>
179
  <summary>Batch inference</summary>
180
-
181
- All inputs in a batch must have the same modality structure—for example, text-only with text-only, single-image with single-image, and multi-image inputs with the same number of images—to ensure correct batch inference.
182
 
183
  ```python
184
  conversation_1 = [
@@ -190,7 +186,6 @@ conversation_1 = [
190
  ],
191
  },
192
  ]
193
-
194
  conversation_2 = [
195
  {
196
  "role": "user",
@@ -200,7 +195,6 @@ conversation_2 = [
200
  ],
201
  },
202
  ]
203
-
204
  inputs = processor.apply_chat_template(
205
  [conversation_1, conversation_2],
206
  add_generation_prompt=True,
@@ -209,7 +203,6 @@ inputs = processor.apply_chat_template(
209
  padding=True,
210
  return_tensors="pt"
211
  ).to(model.device, torch.float16)
212
-
213
  generate_ids = model.generate(**inputs, max_new_tokens=1024)
214
  generate_ids_trimmed = [
215
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
@@ -224,9 +217,7 @@ print(output)
224
 
225
  ```python
226
  from PIL import Image
227
-
228
  image = Image.open("file:///path/to/image.jpg")
229
-
230
  # Image upscaling for OCR performance boost
231
  w, h = image.size
232
  target_size = 2304
@@ -235,17 +226,15 @@ if max(w, h) < target_size:
235
  new_w = int(w * scaling_factor)
236
  new_h = int(h * scaling_factor)
237
  image = image.resize((new_w, new_h))
238
-
239
  conversation = [
240
  {
241
  "role": "user",
242
  "content": [
243
  {"type": "image", "image": image},
244
- {"type": "text", "text": "<ocr>"},
245
  ],
246
  },
247
  ]
248
-
249
  inputs = processor.apply_chat_template(
250
  conversation,
251
  add_generation_prompt=True,
@@ -253,7 +242,6 @@ inputs = processor.apply_chat_template(
253
  return_dict=True,
254
  return_tensors="pt"
255
  ).to(model.device, torch.float16)
256
-
257
  generate_ids = model.generate(**inputs, max_new_tokens=1024)
258
  generate_ids_trimmed = [
259
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
 
28
  In addition to the 14B full-scale model, a lightweight 1.7B version is available for on-device use, making it accessible on personal devices such as smartphones and PCs. VARCO-VISION-2.0 is a powerful open-source AI model built for Korean users and is freely available for a wide range of applications.
29
 
30
  ## 🚨News🎙️
31
+ - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B-OCR)
32
+ - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-1.7B)
33
  - 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
34
  - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
35
  - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
 
56
  VARCO-VISION-2.0 follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
57
 
58
  ## Evaluation
59
+ We used [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, ensuring fair comparisons with various open-source models.
60
  Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.
61
 
62
  ### English Benchmark
63
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B |VARCO-VISION-2.0-14B |
64
  | :-----------: | :-----------: | :-------: | :-----------: |:------------------: |
65
  | MMStar | **68.9** | *67.2* | 64.1 | 66.5 |
66
+ | SEEDBench_IMG | 77.5 | *77.7* | 77.0 | **78.0** |
67
+ | LLaVABench | 84.4 | **93.0** | *91.0* | 90.2 |
68
+ | OCRBench | 877 | *879* | **888** | 869 |
69
 
70
  ### Korean Benchmark
71
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
 
84
  ### Text-only Benchmark
85
  | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B |
86
  | :--------: | :-----------: | :-------: | :-----------: | :------------------: |
87
+ | MMLU | **78.5** | *78.4* | 4.6 | 77.9 |
88
  | MT-Bench | *8.93* | 8.59 | 8.07 | **8.98** |
89
  | KMMLU | *51.4* | 49.3 | 39.6 | **57.5** |
90
  | KoMT-Bench | 7.01 | **7.91** | 6.84 | *7.83* |
 
95
  ### OCR Benchmark
96
  | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
97
  | :-------: | :-------: | :-----: | :------------------: |
98
+ | CORD | *91.4* | 77.8 | **97.1** |
99
+ | ICDAR2013 | *92.0* | 85.0 | **95.7** |
100
+ | ICDAR2015 | *73.7* | 57.9 | **79.4** |
101
 
102
  ## Usage
103
  To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
 
134
  return_dict=True,
135
  return_tensors="pt"
136
  ).to(model.device, torch.float16)
 
137
  generate_ids = model.generate(**inputs, max_new_tokens=1024)
138
  generate_ids_trimmed = [
139
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
 
156
  ],
157
  },
158
  ]
 
159
  inputs = processor.apply_chat_template(
160
  conversation,
161
  add_generation_prompt=True,
 
163
  return_dict=True,
164
  return_tensors="pt"
165
  ).to(model.device, torch.float16)
 
166
  generate_ids = model.generate(**inputs, max_new_tokens=1024)
167
  generate_ids_trimmed = [
168
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
 
174
 
175
  <details>
176
  <summary>Batch inference</summary>
177
+ All inputs in a batch must have the same modality structure—for example, text-only with text-only, single-image with single-image, and multi-image with multi-image—to ensure correct batch inference.
 
178
 
179
  ```python
180
  conversation_1 = [
 
186
  ],
187
  },
188
  ]
 
189
  conversation_2 = [
190
  {
191
  "role": "user",
 
195
  ],
196
  },
197
  ]
 
198
  inputs = processor.apply_chat_template(
199
  [conversation_1, conversation_2],
200
  add_generation_prompt=True,
 
203
  padding=True,
204
  return_tensors="pt"
205
  ).to(model.device, torch.float16)
 
206
  generate_ids = model.generate(**inputs, max_new_tokens=1024)
207
  generate_ids_trimmed = [
208
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
 
217
 
218
  ```python
219
  from PIL import Image
 
220
  image = Image.open("file:///path/to/image.jpg")
 
221
  # Image upscaling for OCR performance boost
222
  w, h = image.size
223
  target_size = 2304
 
226
  new_w = int(w * scaling_factor)
227
  new_h = int(h * scaling_factor)
228
  image = image.resize((new_w, new_h))
 
229
  conversation = [
230
  {
231
  "role": "user",
232
  "content": [
233
  {"type": "image", "image": image},
234
+ {"type": "text", "text": ""},
235
  ],
236
  },
237
  ]
 
238
  inputs = processor.apply_chat_template(
239
  conversation,
240
  add_generation_prompt=True,
 
242
  return_dict=True,
243
  return_tensors="pt"
244
  ).to(model.device, torch.float16)
 
245
  generate_ids = model.generate(**inputs, max_new_tokens=1024)
246
  generate_ids_trimmed = [
247
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)