File size: 9,211 Bytes
13ae7ec 6df10e9 13ae7ec 083d6f7 13ae7ec 4b8a24b 083d6f7 40413f0 13ae7ec b8b098a 13ae7ec 083d6f7 756c672 baa53d1 13ae7ec 083d6f7 756c672 baa53d1 60f7cbf baa53d1 13ae7ec 083d6f7 13ae7ec ac8ad0e 4843f9b 13ae7ec 083d6f7 13ae7ec b27f64f 13ae7ec 6df10e9 742b2b5 6df10e9 13ae7ec 742b2b5 cee615d 13ae7ec dffaa58 6068e9c e2a0796 6068e9c 742b2b5 13ae7ec 40413f0 c20aac4 a6750a3 13ae7ec 60f7cbf 083d6f7 13ae7ec 083d6f7 13ae7ec 083d6f7 13ae7ec 083d6f7 13ae7ec 083d6f7 13ae7ec 083d6f7 eea7715 083d6f7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: apache-2.0
language:
- en
base_model:
- ibm-granite/granite-vision-3.3-2b
library_name: transformers
---
# granite-vision-3.3-2b-embedding
**Model Summary:**
Granite-vision-3.3-2b-embedding is an efficient embedding model based on [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b). This model is specifically designed for multimodal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layouts. The model generates ColBERT-style multi-vector representations of pages.
By removing the need for OCR-based text extractions, granite-vision-3.3-2b-embedding can help simplify and accelerate RAG pipelines.
**Evaluations:**
We evaluated granite-vision-3.3-2b-embedding alongside other top colBERT style multi-modal embedding models in the 1B-4B parameter range using two benchmark: [Vidore2](https://github.com/illuin-tech/vidore-benchmark/) and [Real-MM-RAG-Bench](https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34) which aim to specifically address complex multimodal document retrieval tasks.
## **NDCG@5 - ViDoRe V2**
| Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
|----------------------------------------|--------------|------------------|-------------|-------------------|-----------
| ESG Restaurant Human | 51.1 | 68.4 | 65.8 | 62.4 | 65.3 |
| Economics Macro Multilingual | 49.9 | 56.5 | 55.4 | 47.4 | 51.2 |
| MIT Biomedical | 59.7 | 63.6 | 63.5 | 58.1 |61.5 |
| ESG Restaurant Synthetic | 57.0 | 57.4 | 56.6 | 51.1 |56.6 |
| ESG Restaurant Synthetic Multilingual | 55.7 | 57.4 | 57.2 | 47.6 |55.7 |
| MIT Biomedical Multilingual | 56.5 | 61.1 | 62.5 | 50.5 | 55.5 |
| Economics Macro | 51.6 | 59.8 | 60.2 | 60.9 |58.3 |
| **Avg (ViDoRe2)** | **54.5** | **60.6** | **60.2** | **54.0** |**57.7** |
## **NDCG@5 - REAL-MM-RAG**
| Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
|----------------------------------------|--------------|------------------|-------------|--------------------------| ------------------
| FinReport | 55 | 66 | 78 | 65 |73
| FinSlides | 68 | 79 | 81 | 55 |79
| TechReport | 78 | 86 | 88 | 83 |87
| TechSlides | 90 | 93 | 92 | 91 |93
| **Avg (REAL-MM-RAG)** | **73** | **81** | **85** | **74** |**83**
- **Release Date**: June 11th 2025
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Supported Input Format:** Currently the model supports English instructions and images (png, jpeg) as input format.
**Intended Use:**
The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
### Usage
```shell
pip install -q torch torchvision torchaudio
pip install transformers==4.50
```
Then run the code:
```python
from io import BytesIO
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel
from transformers.utils.import_utils import is_flash_attn_2_available
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "ibm-granite/granite-vision-3.3-2b-embedding"
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map=device,
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None
).eval()
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
# βββββββββββββββββββββββββββββββββββββββββββββ
# Inputs: Image + Text
# βββββββββββββββββββββββββββββββββββββββββββββ
image_url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
print("\nFetching image...")
image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")
text = "A photo of a tiger"
print(f"Image and text inputs ready.")
# Process both inputs
print("Processing inputs...")
image_inputs = processor.process_images([image])
text_inputs = processor.process_queries([text])
# Move to correct device
image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
# βββββββββββββββββββββββββββββββββββββββββββββ
# Run Inference
# βββββββββββββββββββββββββββββββββββββββββββββ
with torch.no_grad():
print("π Getting image embedding...")
img_emb = model(**image_inputs)
print("βοΈ Getting text embedding...")
txt_emb = model(**text_inputs)
# βββββββββββββββββββββββββββββββββββββββββββββ
# Score the similarity
# βββββββββββββββββββββββββββββββββββββββββββββ
print("Scoring similarity...")
similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device)
print("\n" + "=" * 50)
print(f"π Similarity between image and text: {similarity.item():.4f}")
print("=" * 50)
```
### Use granite-vision-embedding-3.3-2b for MM RAG
For an example of MM-RAG using granite-vision-3.3-2b-embedding refer to [this notebook](https://github.com/ibm-granite/granite-vision-models/blob/main/cookbooks/GraniteVisionEmbedding_MM-RAG_Notebook.ipynb).
**Model Architecture:**
The architecture of granite-vision-3.3-2b-embedding follows ColPali(https://arxiv.org/abs/2407.01449) approach and consists of the following components:
(1) Vision-Language model : granite-vision-3.3-2b (https://huggingface.co/ibm-granite/granite-vision-3.3-2b).
(2) Projection layer: linear layer that projects the hidden layer dimension of Vision-Language model to 128 and outputs 729 embedding vectors per image.
The scoring is computed using MaxSim-based late interaction mechanism.
**Training Data:**
Our training data is entirly comprised from DocFM. DocFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
reports.
**Infrastructure:**
We train granite-vision-3.3-2b-embedding on IBMβs cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
**Ethical Considerations and Limitations:**
The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-vision-3.3-2b-embedding is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.3-2b-embedding with ethical intentions and in a responsible way.
**Resources**
- π Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
- π Real-MM-RAG-Bench paper (ACL 2025) [here](https://arxiv.org/abs/2502.12342)
- π Vidore 2 paper [here](https://www.arxiv.org/pdf/2505.17166)
- βοΈ Learn about the latest updates with Granite: https://www.ibm.com/granite
- π Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
- π‘ Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
|