File size: 5,257 Bytes

862f672

---
license: cc-by-nc-sa-4.0
datasets:
- turing-motors/LLaVA-Pretrain-JA
- turing-motors/LLaVA-v1.5-Instruct-620K-JA
language:
- ja
base_model:
- openai/clip-vit-large-patch14-336
- sbintuitions/sarashina2.2-1b-instruct-v0.1
pipeline_tag: image-text-to-text
---

# llava-1.5-sarashina2.2-1.7b-instruct Model Card (EN)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/62441d1d9fdefb55a0b7d12c/FPshq08TKYD0e-qwPLDVO.png)

Below is the model card of the llava-1.5-sarashina2.2-1.7b-instruct model, which is almost an exact duplicate of the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).

## Model details

**Model type:** <br>
llava-1.5-sarashina2.2-1.7b-instruct is an open-source chatbot trained by fine-tuning [sbintuitions/sarashina2.2-1b-instruct-v0.1](https://huggingface.co/sbintuitions/sarashina2.2-1b-instruct-v0.1) on GPT-generated multimodal instruction-following data.
It is an auto-regressive language model, based on the transformer architecture.

- Total params: 1,716,099,840 (1.7B)
  - LLM ([sbintuitions/sarashina2.2-1b-instruct-v0.1](https://huggingface.co/sbintuitions/sarashina2.2-1b-instruct-v0.1)): 1,407,542,528 (1.4B)
  - Projector (2-layer MLP): 5,049,856 (5M)
  - Vision Encoder ([openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)): 303,507,456 (303M)

**Model date:** <br>
llava-1.5-sarashina2.2-1.7b-instruct was trained in May 2025.

**Training Settings:** <br>
We trained **only the projector** using the following two datasets.
> *Note: To preserve the original LLM’s performance, **we skipped Stage 2 (which would have enabled training both the projector and the LLM)** and instead incorporated the data originally intended for Stage 2 into Stage 1.*

- **Stage 1**
  - **[LLaVA-Pretrain-JA](https://huggingface.co/datasets/turing-motors/LLaVA-Pretrain-JA) (558K)**
    - We used data re-captioned with [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct).
  - **[LLaVA-v1.5-Instruct-620K-JA](https://huggingface.co/datasets/turing-motors/LLaVA-v1.5-Instruct-620K-JA) (522K)**
    - We removed OCR-based datasets (ocr\_vqa, textvqa) to avoid potential inconsistencies introduced by translating into Japanese.

## How to use the model

First, make sure to have `transformers >= 4.35.3`. 
The model supports multi-prompt generation. Make sure also to follow the correct prompt template (`USER: xxxASSISTANT: `) and add the token `<image>` to the location where you want to query images:

### Using pure `transformers`:

Below is an example script to run generation in `bfloat16` precision on a GPU device:

```python
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "turing-motors/llava-1.5-sarashina2.2-1.7b-instruct"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
).to("cuda")

processor = AutoProcessor.from_pretrained(model_id, use_fast=True)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "猫は何匹いますか？"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])
# USER: 
# 猫は何匹いますか？ASSISTANT: 画像には2匹の猫がいます。
```

-----------
From transformers>=v4.48, you can also pass image url or local path to the conversation history, and let the chat template handle the rest.
Chat template will load the image for you and return inputs in `torch.Tensor` which you can pass directly to `model.generate()` 

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
            {"type": "text", "text": "画像を非常に短く説明して。"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=128)
generated_texts = processor.batch_decode(
    output,
    skip_special_tokens=True,
)
print(generated_texts[0])
# USER: 
# 画像を非常に短く説明して。ASSISTANT: 画像は、赤い停止標識と、その横にある赤い門を持つ伝統的な中国の門の2つの標識が写っています。
```

## License
Creative Commons Attribution Non Commercial Share Alike 4.0; and it should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use