NCSOFT
/

VARCO-VISION-2.0-14B

@@ -122,7 +122,7 @@ conversation = [
         "role": "user",
         "content": [
             {"type": "image", "url": "https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B/resolve/main/demo.jpg"},
-            {"type": "text", "text": "이 이미지에 표시된 것은 무엇인가요?"},
         ],
     },
 ]
@@ -175,6 +175,50 @@ print(output)
 ```
 </details>
 <details>
 <summary>OCR inference</summary>

         "role": "user",
         "content": [
             {"type": "image", "url": "https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B/resolve/main/demo.jpg"},
+            {"type": "text", "text": "각 박스마다 한 줄씩 색상과 글자를 정확하게 출력해주세요."},
         ],
     },
 ]
 ```
 </details>
+<details>
+<summary>Batch inference</summary>
+All inputs in a batch must have the same modality structure—for example, text-only with text-only, single-image with single-image, and multi-image inputs with the same number of images—to ensure correct batch inference.
+```python
+conversation_1 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/image1.jpg"},
+            {"type": "text", "text": "이미지를 설명해주세요."},
+        ],
+    },
+]
+conversation_2 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/image2.jpg"},
+            {"type": "text", "text": "이 이미지에 표시된 것은 무엇인가요?"},
+        ],
+    },
+]
+inputs = processor.apply_chat_template(
+    [conversation_1, conversation_2],
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    padding=True,
+    return_tensors="pt"
+).to(model.device, torch.float16)
+generate_ids = model.generate(**inputs, max_new_tokens=1024)
+generate_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
+]
+output = processor.batch_decode(generate_ids_trimmed, skip_special_tokens=True)
+print(output)
+```
+</details>
 <details>
 <summary>OCR inference</summary>