NCSOFT
/

VARCO-VISION-2.0-14B

@@ -88,11 +88,11 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
 **Note**: Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
 ### OCR Benchmark
-| Benchmark | PaddleOCR | VARCO-VISION-2.0-14B |
-| :-------: | :-------: | :------------------: |
-| CORD      | *91.4*    | **93.3**             |
-| ICDAR2013 | *92.0*    | **93.2**             |
-| ICDAR2015 | *73.7*    | **82.7**             |
 ## Usage
 To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
@@ -100,8 +100,6 @@ To use this model, we recommend installing `transformers` version **4.53.1 or hi
 The basic usage is **identical to** [LLaVA-OneVision](https://huggingface.co/docs/transformers/main/en/model_doc/llava_onevision#usage-example):
 ```python
-import requests
-from PIL import Image
 import torch
 from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
@@ -114,49 +112,26 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
 )
 processor = AutoProcessor.from_pretrained(model_name)
-conversation_1 = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
-            {"type": "text", "text": "What is shown in this image?"},
-            ],
-    },
-    {
-        "role": "assistant",
-        "content": [
-            {"type": "text", "text": "There is a red stop sign in the image."},
-            ],
-    },
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
-            {"type": "text", "text": "What about this image? How many cats do you see?"},
-            ],
-    },
-]
-conversation_2 = [
     {
         "role": "user",
         "content": [
             {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
             {"type": "text", "text": "이 이미지에는 무엇이 보이나요?"},
-            ],
     },
 ]
 inputs = processor.apply_chat_template(
-    [conversation_1, conversation_2],
     add_generation_prompt=True,
     tokenize=True,
     return_dict=True,
-    padding=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
-outputs = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
 print(outputs)
 ```
 The following shows the input required for using OCR with text localization, along with the corresponding output:
@@ -168,7 +143,7 @@ conversation = [
     {
         "role": "user",
         "content": [
-            {"type": "text", "text": "<ocr>"},
             {"type": "image"},
         ],
     },
@@ -196,4 +171,4 @@ conversation = [
 ```
 <div align="center">
     <img src="./ocr.jpg" width="100%" />
-</div>

 **Note**: Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
 ### OCR Benchmark
+| Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
+| :-------: | :-------: | :-----: | :------------------: |
+| CORD      | *91.4*    | 77.8    | **93.3**             |
+| ICDAR2013 | *92.0*    | 85.0    | **93.2**             |
+| ICDAR2015 | *73.7*    | 57.9    | **82.7**             |
 ## Usage
 To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
 The basic usage is **identical to** [LLaVA-OneVision](https://huggingface.co/docs/transformers/main/en/model_doc/llava_onevision#usage-example):
 ```python
 import torch
 from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
 )
 processor = AutoProcessor.from_pretrained(model_name)
+conversation = [
     {
         "role": "user",
         "content": [
             {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
             {"type": "text", "text": "이 이미지에는 무엇이 보이나요?"},
+        ],
     },
 ]
 inputs = processor.apply_chat_template(
+    conversation,
     add_generation_prompt=True,
     tokenize=True,
     return_dict=True,
     return_tensors="pt"
 ).to(model.device, torch.float16)
 generate_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
+outputs = processor.decode(generate_ids[0], skip_special_tokens=True)
 print(outputs)
 ```
 The following shows the input required for using OCR with text localization, along with the corresponding output:
     {
         "role": "user",
         "content": [
+            {"type": "text", "text": ""},
             {"type": "image"},
         ],
     },
 ```
 <div align="center">
     <img src="./ocr.jpg" width="100%" />
+</div>