dev-bjoern
/

smolvlm-int4-ov

+---
+library_name: transformers
+license: apache-2.0
+language:
+- en
+base_model:
+  - HuggingFaceTB/SmolVLM-Instruct
+tags:
+- openvino
+- int4
+- quantization
+- edge-deployment
+- optimization
+- vision-language-model
+- multimodal
+- smolvlm
+inference: false
+---
+# SmolVLM INT4 OpenVINO
+## 🚀 Optimized Vision-Language Model for Edge Deployment
+This is an INT4 quantized version of [SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) using OpenVINO, designed for efficient multimodal inference on edge devices and CPUs.
+## Model Overview
+- **Base Model:** SmolVLM-Instruct (2.25B parameters)
+- **Quantization:** INT4 via OpenVINO
+- **Model Type:** Vision-Language Model (VLM)
+- **Capabilities:** Image captioning, visual Q&A, multimodal reasoning
+- **Target Hardware:** CPUs, Intel GPUs, NPUs
+- **Use Cases:** On-device multimodal AI, edge vision applications
+## 🔧 Technical Details
+### Quantization Process
+```python
+# Quantized using OpenVINO NNCF
+# INT4 symmetric quantization
+# Applied to both vision encoder and language decoder
+```
+### Model Architecture
+- Vision Encoder: Shape-optimized SigLIP (INT4)
+- Text Decoder: SmolLM2 (INT4)
+- Visual tokens: 81 per 384×384 patch
+- Supports arbitrary image-text interleaving
+## 📊 Performance (Experimental)
+> ⚠️ **Note:** This is an experimental quantization. Formal benchmarks pending.
+Expected benefits of INT4 quantization:
+- Significantly reduced model size
+- Faster inference on CPU/edge devices
+- Lower memory requirements for multimodal tasks
+- Maintained visual understanding capabilities
+## 🛠️ How to Use
+### Installation
+```bash
+pip install optimum[openvino] transformers pillow
+```
+### Basic Usage
+```python
+from optimum.intel import OVModelForVision2Seq
+from transformers import AutoProcessor
+from PIL import Image
+import requests
+# Load model and processor
+model_id = "dev-bjoern/smolvlm-int4-ov"
+processor = AutoProcessor.from_pretrained(model_id)
+model = OVModelForVision2Seq.from_pretrained(model_id)
+# Load an image
+url = "https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+# Create conversation
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "What do you see in this image?"}
+        ]
+    }
+]
+# Process and generate
+prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(text=prompt, images=[image], return_tensors="pt")
+generated_ids = model.generate(**inputs, max_new_tokens=200)
+output = processor.batch_decode(generated_ids, skip_special_tokens=True)
+print(output[0])
+```
+### Multiple Images
+```python
+# Load multiple images
+image1 = Image.open("path/to/image1.jpg")
+image2 = Image.open("path/to/image2.jpg")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "image"},
+            {"type": "text", "text": "Compare these two images"}
+        ]
+    }
+]
+# Process with multiple images
+inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
+```
+## 🎯 Intended Use
+- **Edge AI vision applications**
+- **Local multimodal assistants**
+- **Privacy-focused image analysis**
+- **Resource-constrained deployment**
+- **Real-time visual understanding**
+## ⚡ Optimization Tips
+1. **Image Resolution:** Adjust with `size={"longest_edge": N*384}` where N=3 or 4 for balance
+2. **Batch Processing:** Process multiple images together when possible
+3. **CPU Inference:** Leverage OpenVINO runtime optimizations
+## 🧪 Experimental Status
+This is my first experiment with OpenVINO INT4 quantization for vision-language models. Feedback welcome!
+### Known Limitations
+- No formal benchmarks yet
+- Visual quality degradation not measured
+- Optimal quantization settings still being explored
+### Future Improvements
+- [ ] Benchmark on standard VLM tasks
+- [ ] Compare with original model performance
+- [ ] Experiment with mixed precision
+- [ ] Test on various hardware configurations
+## 🤝 Contributing
+Have suggestions or found issues? Please open a discussion!
+## 📚 Resources
+- [Original SmolVLM Model](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct)
+- [SmolVLM Blog Post](https://huggingface.co/blog/smolvlm)
+- [OpenVINO Documentation](https://docs.openvino.ai/)
+- [Optimum Intel Guide](https://huggingface.co/docs/optimum/intel/index)
+## 🙏 Acknowledgments
+- HuggingFace team for SmolVLM
+- Intel OpenVINO team for quantization tools
+- Vision-language model community
+## 📝 Citation
+If you use this model, please cite both works:
+```bibtex
+@misc{smolvlm-int4-ov,
+  author = {Bjoern Bethge},
+  title = {SmolVLM INT4 OpenVINO},
+  year = {2024},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/dev-bjoern/smolvlm-int4-ov}}
+}
+@article{marafioti2025smolvlm,
+  title={SmolVLM: Redefining small and efficient multimodal models},
+  author={Andrés Marafioti and others},
+  journal={arXiv preprint arXiv:2504.05299},
+  year={2025}
+}
+```
+---
+**Status:** 🧪 Experimental | **Model Type:** Vision-Language | **License:** Apache 2.0