File size: 3,264 Bytes
6d78253 03ae1bb 6d78253 03ae1bb 2b4b113 6d78253 7738be3 6d78253 05d02c4 6d78253 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---
<div align="center">
<h1>AndesVL-0.6B-Instruct</h1>
<a href='https://arxiv.org/abs/2510.11496'><img src='https://img.shields.io/badge/arXiv-2510.11496-b31b1b.svg'></a>
<a href='https://huggingface.co/OPPOer'><img src='https://img.shields.io/badge/🤗%20HuggingFace-AndesVL-ffd21f.svg'></a>
<a href='https://github.com/OPPO-Mente-Lab/AndesVL_Evaluation'><img src="https://img.shields.io/badge/GitHub-OPPOer-blue.svg?logo=github" alt="GitHub"></a>
</div>
AndesVL is a suite of mobile-optimized Multimodal Large Language Models (MLLMs) with **0.6B to 4B parameters**, built upon Qwen3's LLM and various visual encoders. Designed for efficient edge deployment, it achieves first-tier performance on diverse benchmarks, including those for text-rich tasks, reasoning tasks, Visual Question Answering (VQA), multi-image tasks, multilingual tasks, and GUI tasks. Its "1+N" LoRA architecture and QALFT framework facilitate efficient task adaptation and model compression, enabling a 6.7x peak decoding speedup and a 1.8 bits-per-weight compression ratio on mobile chips.
Detailed model sizes and components are provided below:
| Model | Total Parameters (B) | Visual Encoder | LLM |
|---|---|---|---|
| **AndesVL-0.6B** | 0.695 | SigLIP2-Base | Qwen3-0.6B |
| AndesVL-1B | 0.927 | AIMv2-Large | Qwen3-0.6B |
| AndesVL-2B | 2.055 | AIMv2-Large | Qwen3-1.7B|
| AndesVL-4B | 4.360 | AIMv2-Large | Qwen3-4B |
# Quick Start
```commandline
# require transformers>=4.52.4
import torch
from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor
model_dir = "OPPOer/AndesVL-0_6B-Instruct"
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True,torch_dtype=torch.bfloat16).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
image_processor = CLIPImageProcessor.from_pretrained(model_dir, trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "描述这张图片。"},
{
"type": "image_url",
"image_url": {
"url": "https://i-blog.csdnimg.cn/blog_migrate/2f4c88e71f7eabe46d062d2f1ec77d10.jpeg" # image/to/path
},
}
],
},
]
res = model.chat(messages, tokenizer, image_processor, max_new_tokens=1024, do_sample=True, temperature=0.6)
print(res)
```
# Citation
If you find our work helpful, feel free to give us a cite.
```
@misc{jin2025andesvltechnicalreportefficient,
title={AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model},
author={AndesVL Team, OPPO AI Center},
year={2025},
eprint={2510.11496},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.11496},
}
```
# Acknowledge
We are very grateful for the efforts of the [Qwen](https://huggingface.co/Qwen), [AimV2](https://huggingface.co/apple/aimv2-large-patch14-224) and [Siglip 2](https://arxiv.org/abs/2502.14786) projects. |