kimyoungjune commited on
Commit
97a7960
·
verified ·
1 Parent(s): 28ed419

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -36
README.md CHANGED
@@ -88,11 +88,11 @@ Please note that for certain benchmarks involving LLM-based evaluation (e.g., LL
88
  **Note**: Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
89
 
90
  ### OCR Benchmark
91
- | Benchmark | PaddleOCR | VARCO-VISION-2.0-14B |
92
- | :-------: | :-------: | :------------------: |
93
- | CORD | *91.4* | **93.3** |
94
- | ICDAR2013 | *92.0* | **93.2** |
95
- | ICDAR2015 | *73.7* | **82.7** |
96
 
97
  ## Usage
98
  To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
@@ -100,8 +100,6 @@ To use this model, we recommend installing `transformers` version **4.53.1 or hi
100
  The basic usage is **identical to** [LLaVA-OneVision](https://huggingface.co/docs/transformers/main/en/model_doc/llava_onevision#usage-example):
101
 
102
  ```python
103
- import requests
104
- from PIL import Image
105
  import torch
106
  from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
107
 
@@ -114,49 +112,26 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
114
  )
115
  processor = AutoProcessor.from_pretrained(model_name)
116
 
117
- conversation_1 = [
118
- {
119
- "role": "user",
120
- "content": [
121
- {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
122
- {"type": "text", "text": "What is shown in this image?"},
123
- ],
124
- },
125
- {
126
- "role": "assistant",
127
- "content": [
128
- {"type": "text", "text": "There is a red stop sign in the image."},
129
- ],
130
- },
131
- {
132
- "role": "user",
133
- "content": [
134
- {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
135
- {"type": "text", "text": "What about this image? How many cats do you see?"},
136
- ],
137
- },
138
- ]
139
- conversation_2 = [
140
  {
141
  "role": "user",
142
  "content": [
143
  {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
144
  {"type": "text", "text": "이 이미지에는 무엇이 보이나요?"},
145
- ],
146
  },
147
  ]
148
 
149
  inputs = processor.apply_chat_template(
150
- [conversation_1, conversation_2],
151
  add_generation_prompt=True,
152
  tokenize=True,
153
  return_dict=True,
154
- padding=True,
155
  return_tensors="pt"
156
  ).to(model.device, torch.float16)
157
 
158
  generate_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
159
- outputs = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
160
  print(outputs)
161
  ```
162
  The following shows the input required for using OCR with text localization, along with the corresponding output:
@@ -168,7 +143,7 @@ conversation = [
168
  {
169
  "role": "user",
170
  "content": [
171
- {"type": "text", "text": "<ocr>"},
172
  {"type": "image"},
173
  ],
174
  },
@@ -196,4 +171,4 @@ conversation = [
196
  ```
197
  <div align="center">
198
  <img src="./ocr.jpg" width="100%" />
199
- </div>
 
88
  **Note**: Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.
89
 
90
  ### OCR Benchmark
91
+ | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B |
92
+ | :-------: | :-------: | :-----: | :------------------: |
93
+ | CORD | *91.4* | 77.8 | **93.3** |
94
+ | ICDAR2013 | *92.0* | 85.0 | **93.2** |
95
+ | ICDAR2015 | *73.7* | 57.9 | **82.7** |
96
 
97
  ## Usage
98
  To use this model, we recommend installing `transformers` version **4.53.1 or higher**. While it may work with earlier versions, using **4.53.1 or above is strongly recommended**, especially to ensure optimal performance for the **multi-image feature**.
 
100
  The basic usage is **identical to** [LLaVA-OneVision](https://huggingface.co/docs/transformers/main/en/model_doc/llava_onevision#usage-example):
101
 
102
  ```python
 
 
103
  import torch
104
  from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
105
 
 
112
  )
113
  processor = AutoProcessor.from_pretrained(model_name)
114
 
115
+ conversation = [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  {
117
  "role": "user",
118
  "content": [
119
  {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
120
  {"type": "text", "text": "이 이미지에는 무엇이 보이나요?"},
121
+ ],
122
  },
123
  ]
124
 
125
  inputs = processor.apply_chat_template(
126
+ conversation,
127
  add_generation_prompt=True,
128
  tokenize=True,
129
  return_dict=True,
 
130
  return_tensors="pt"
131
  ).to(model.device, torch.float16)
132
 
133
  generate_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
134
+ outputs = processor.decode(generate_ids[0], skip_special_tokens=True)
135
  print(outputs)
136
  ```
137
  The following shows the input required for using OCR with text localization, along with the corresponding output:
 
143
  {
144
  "role": "user",
145
  "content": [
146
+ {"type": "text", "text": ""},
147
  {"type": "image"},
148
  ],
149
  },
 
171
  ```
172
  <div align="center">
173
  <img src="./ocr.jpg" width="100%" />
174
+ </div>