license: cc-by-nc-sa-4.0
datasets:
- turing-motors/LLaVA-Pretrain-JA
- turing-motors/LLaVA-v1.5-Instruct-620K-JA
language:
- ja
base_model:
- openai/clip-vit-large-patch14-336
- sbintuitions/sarashina2.2-1b-instruct-v0.1
pipeline_tag: image-text-to-text
llava-1.5-sarashina2.2-1.7b-instruct Model Card (EN)
Below is the model card of the llava-1.5-sarashina2.2-1.7b-instruct model, which is almost an exact duplicate of the original Llava model card that you can find here.
Model details
Model type:
llava-1.5-sarashina2.2-1.7b-instruct is an open-source chatbot trained by fine-tuning sbintuitions/sarashina2.2-1b-instruct-v0.1 on GPT-generated multimodal instruction-following data.
It is an auto-regressive language model, based on the transformer architecture.
- Total params: 1,716,099,840 (1.7B)
- LLM (sbintuitions/sarashina2.2-1b-instruct-v0.1): 1,407,542,528 (1.4B)
- Projector (2-layer MLP): 5,049,856 (5M)
- Vision Encoder (openai/clip-vit-large-patch14-336): 303,507,456 (303M)
Model date:
llava-1.5-sarashina2.2-1.7b-instruct was trained in May 2025.
Training Settings:
We trained only the projector using the following two datasets.
Note: To preserve the original LLM’s performance, we skipped Stage 2 (which would have enabled training both the projector and the LLM) and instead incorporated the data originally intended for Stage 2 into Stage 1.
- Stage 1
- LLaVA-Pretrain-JA (558K)
- We used data re-captioned with Qwen/Qwen2.5-VL-7B-Instruct.
- LLaVA-v1.5-Instruct-620K-JA (522K)
- We removed OCR-based datasets (ocr_vqa, textvqa) to avoid potential inconsistencies introduced by translating into Japanese.
- LLaVA-Pretrain-JA (558K)
How to use the model
First, make sure to have transformers >= 4.35.3.
The model supports multi-prompt generation. Make sure also to follow the correct prompt template (USER: xxxASSISTANT: ) and add the token <image> to the location where you want to query images:
Using pure transformers:
Below is an example script to run generation in bfloat16 precision on a GPU device:
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "turing-motors/llava-1.5-sarashina2.2-1.7b-instruct"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
).to("cuda")
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "猫は何匹いますか?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
# USER:
# 猫は何匹いますか?ASSISTANT: 画像には2匹の猫がいます。
From transformers>=v4.48, you can also pass image url or local path to the conversation history, and let the chat template handle the rest.
Chat template will load the image for you and return inputs in torch.Tensor which you can pass directly to model.generate()
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "画像を非常に短く説明して。"},
],
},
]
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=128)
generated_texts = processor.batch_decode(
output,
skip_special_tokens=True,
)
print(generated_texts[0])
# USER:
# 画像を非常に短く説明して。ASSISTANT: 画像は、赤い停止標識と、その横にある赤い門を持つ伝統的な中国の門の2つの標識が写っています。
License
Creative Commons Attribution Non Commercial Share Alike 4.0; and it should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
