ViDoRe V3: a comprehensive evaluation of retrieval for enterprise use-cases
TL;DR
ILLUIN Technology is proud to release the ViDoRe V3 benchmark, designed and developed with contributions from NVIDIA. ViDoRe V3 is our latest benchmark, engineered to set a new industry gold standard for multi-modal, enterprise document retrieval evaluation. It addresses a critical challenge in production RAG systems: retrieving accurate information from complex, visually-rich documents.
ViDoRe V3 improves on existing RAG benchmarks by prioritizing enterprise relevance and rigorous data quality. Instead of relying on clean academic texts, the benchmark draws from 10 challenging, real-world datasets spanning diverse industrial domains, with 8 publicly released and 2 kept private. In addition, while previous benchmarks often rely on synthetically generated data, ViDoRe V3 features human-created and human-verified annotations.
This benchmark contains 26,000 pages and 3,099 queries translated into 6 languages. Each query is linked to retrieval ground truth data created and verified by human annotators: relevant pages, precise bounding box annotations for key elements, and a comprehensive reference answer.
Why we built ViDoRe V3
The document retrieval landscape is ever more diverse. New pipelines based on Visual Language Models are challenging traditional systems based on text-retrieval/generation models. With ViDoRe V1 and V2, we showed the first steps towards better evaluating VLM retrievers:
- ViDoRe V1 focused on extractive queries based on single pages,
- ViDoRe V2 extended the benchmark to more open-ended queries.
But corpora were still small compared to real world use cases and both relied heavily on synthetic generation. While these were steps in the right direction, previous benchmarks still left us with a fragmented picture. Corpora needed to be larger and more representative of enterprise data, queries more diverse, end-to-end evaluation was difficult, and human verification was missing.
Our core contributions
To address limitations of previous benchmarks, we focused on 3 main improvement pillars:
- Enterprise-relevant Corpora: We assembled 10 diverse corpora, each focusing on a distinct, enterprise-relevant domain or task. For each domain, we curated 1,000+ pages from diverse, permissively licensed multimodal documents that mirror real-world enterprise retrieval challenges and complexity. 8 datasets are publicly released and 2 are private to prevent overfitting.
- Human Verification of annotations: For each query, we provide human-annotated page-relevancy rankings, bounding boxes, and written answers to enable comprehensive retrieval/RAG evaluation.
- Diverse queries: To systematically identify failure modes, queries span 7 types (e.g., multi-hop, numerical) and 3 formats (question, instruction, keyword). To assess cross-language capabilities, all queries are provided in 6 languages: English, French, Spanish, German, Italian, and Portuguese.
Public datasets
| Dataset | Domain | Corpus Language | Main Modalities | # Pages | # Queries (without translations) |
|---|---|---|---|---|---|
| French Public Company Annual Reports | Finance-FR | French | Text, Table, Charts | 2384 | 320 |
| U.S. Public Company Annual Reports | Finance-EN | English | Text, Table | 2942 | 309 |
| Computer Science Textbooks | Computer Science | English | Text, Infographic, Tables | 1360 | 215 |
| HR Reports from EU | HR | English | Text, Table, Charts | 1110 | 318 |
| French Governmental Energy Reports | Energy | French | Text, Charts | 2229 | 308 |
| USAF Technical Orders | Industrial | English | Text, Tables, Infographics, Images | 5244 | 283 |
| FDA Reports | Pharmaceuticals | English | Text, Charts, Images, Infographic, Tables | 2313 | 364 |
| French Physics Lectures | Physics | French | Text, Images, Infographics | 1674 | 302 |
Private datasets
Two datasets will remain private and will be managed by the MTEB team (big thanks to them!) to ensure benchmark integrity and mitigate overfitting. This evaluation approach should provide a less biased method for assessing visual retriever models, leading to a more representative understanding of their true capabilities. To avoid revealing too much detail about these datasets, we are only disclosing the domain and the language of the documents.
The two private datasets cover:
- Energy-related regulatory documents (English)
- Telecom-related technical standard documents (English)
Query categories
We designed the ViDoRe V3 queries to reflect the diversity and complexity of real-world retrieval tasks. Each query is formatted as a question, instruction, or keyword and tagged with one or more of 7 query types.
| Query Type Definitions | |
|---|---|
| Open-Ended | A query requiring synthesis and explanation of information. The answer must integrate multiple concepts into a coherent narrative rather than citing a single fact. |
| Compare-Contrast | A query requiring identification and articulation of similarities and/or differences between two or more entities, concepts, or topics. |
| Enumerative | A query requesting a complete list of items that meet specific criteria. |
| Numerical | A query expecting a numerical value, obtained either by direct extraction or calculation. |
| Boolean | A query expecting a yes/no answer, potentially requiring reasoning over extracted information. |
| Extractive | A query answerable by directly citing a specific fact or piece of information from the documents. |
| Multi-hop | A query requiring information retrieval from multiple distinct sources or sections, which must then be combined to produce a complete answer. |
To illustrate which combinations of query types tend to occur together and how common each one is, we visualize the distribution and cardinality of all combinations. Single-category queries are most common, but many queries blend multiple types, such as extractive questions requiring numerical comparisons.
We paid particular attention to ensuring queries are challenging for current retrieval systems across all domains. Most queries require information spread across multiple pages, forcing models to extract and synthesize content from the entire documents rather than relying on single-page matches.
A hybrid generation process: how to build a challenging benchmark
To build a benchmark that is robust, difficult, and high-quality, we developed a sophisticated hybrid process that balances human expertise with LLM-driven scaling. Our goal was to create realistic queries, so we began with a page-agnostic approach. Similarly to ViDoRe V2, instead of using a single page, queries were generated from high-level summaries of document sections. This prevents the tasks from being overly simple and ensures they mimic real-world user intent. This generation was achieved through both synthetic pipelines (including NVIDIA NeMo Data Designer with Qwen3-235B) for scale and expert human annotators for nuance and complexity.
With thousands of generated queries for thousands of corpus pages, finding the correct ground-truth answers required a massive annotation effort. We implemented a multi-stage funnel to scale this process. First, a VLM (Qwen2.5-32B) performed a high-recall loose filter to rapidly discard clearly irrelevant pages, limiting false negatives and focusing the annotators' effort. Following this pre-filtering, trained human annotators performed the critical work. They identified the true relevant pages and generated the final, detailed annotations, including page-level relevancy rankings, precise written answers, and ground-truth bounding boxes.
While perfect ground truth is an elusive goal for any dataset at this scale, we invested heavily in enforcing a multi-layered quality control framework. Our annotators had native-level language proficiency and all passed pre-production validation and pilot gates. Key tasks were completed by multiple annotators to ensure consensus, and data underwent both quality control and audit checks by experienced senior annotators. This layered approach was designed to make the ground truth and the benchmark tasks as reliable and realistic as possible.
As a final quality assurance step, we rigorously filtered the annotations. This involved checking for annotator consensus, performing manual review, and using Qwen2.5-VL-32B to confirm the presence of relevant information across the annotated pages. We then leveraged Qwen2.5-VL-32B one last time to merge the remaining outputs into one single golden answer.
A hard benchmark for current retrieval models
We evaluate a wide range of modern visual retrieval models on our benchmark using the MTEB framework. The results confirm that the benchmark is exceptionally challenging for current methods.
The best-performing models reach a score of 65% NDCG@10 on English datasets. When introducing multilingual documents and translated queries, performance degrades significantly, with the average score failing to reach 60% NDCG@10.
A deeper analysis of the results reveals several key patterns:
- Challenges with technical documents: Models struggle significantly when faced with the highly technical documents in our Industrial subset and our private Energy-EN set, particularly with interpreting dense schematics and complex graphs.
- Persistent multilingual challenges: Performance drops considerably on our French documents. For the Physics and Finance-FR splits, no model was able to reach 50% NDCG@10.
- Relative strength in computer science: Models demonstrate higher performance on the Computer Science split. We hypothesize this is a spill-over effect from the massive amount of coding data used to train modern VLMs, making them more knowledgeable about that domain.
The full, detailed evaluation results and a deeper analysis of the dataset's difficulty are available below. All the metrics reported, unless stated otherwise, are NDCG@10.
English evaluation results
| Model | Average | Computer ScienceEN | Energy-EN | Finance-EN | PharmaceuticalsEN | HREN | IndustrialEN | TelecomEN |
|---|---|---|---|---|---|---|---|---|
| nemo-colembed-3b | 0.656 | 0.778 | 0.534 | 0.695 | 0.669 | 0.649 | 0.570 | 0.694 |
| nemo-colembed-1b | 0.643 | 0.755 | 0.522 | 0.670 | 0.662 | 0.645 | 0.561 | 0.687 |
| jinav4 | 0.639 | 0.742 | 0.524 | 0.661 | 0.652 | 0.646 | 0.559 | 0.687 |
| colnomic-7b | 0.630 | 0.782 | 0.482 | 0.631 | 0.646 | 0.629 | 0.542 | 0.696 |
| colnomic-3b | 0.617 | 0.755 | 0.455 | 0.630 | 0.637 | 0.626 | 0.528 | 0.686 |
| colqwen2.5 | 0.592 | 0.752 | 0.429 | 0.612 | 0.609 | 0.592 | 0.494 | 0.653 |
| nomic-7b (dense) | 0.573 | 0.709 | 0.423 | 0.576 | 0.638 | 0.559 | 0.485 | 0.620 |
| colqwen2 | 0.563 | 0.735 | 0.441 | 0.509 | 0.581 | 0.547 | 0.498 | 0.632 |
| colpali-v1.3 | 0.530 | 0.725 | 0.381 | 0.433 | 0.577 | 0.533 | 0.470 | 0.592 |
| nomic-3b (dense) | 0.517 | 0.621 | 0.372 | 0.533 | 0.592 | 0.519 | 0.411 | 0.572 |
| colmodernvbert | 0.507 | 0.597 | 0.420 | 0.504 | 0.566 | 0.470 | 0.439 | 0.552 |
| colsmol256 | 0.464 | 0.574 | 0.365 | 0.477 | 0.514 | 0.460 | 0.385 | 0.475 |
Multilingual results
| Model | Average | Computer ScienceEN | PhysicsFR | Energy-EN | Energy-FR | Finance-EN | PharmaceuticalsEN | HREN | IndustrialEN | Finance-FR | TelecomEN |
|---|---|---|---|---|---|---|---|---|---|---|---|
| jinav4 | 0.576 | 0.718 | 0.466 | 0.500 | 0.640 | 0.593 | 0.631 | 0.595 | 0.504 | 0.461 | 0.648 |
| colnomic-7b | 0.574 | 0.762 | 0.483 | 0.450 | 0.640 | 0.566 | 0.623 | 0.587 | 0.501 | 0.455 | 0.672 |
| nemo-colembed-3b | 0.573 | 0.752 | 0.451 | 0.491 | 0.621 | 0.609 | 0.637 | 0.587 | 0.471 | 0.438 | 0.670 |
| colnomic-3b | 0.558 | 0.727 | 0.475 | 0.421 | 0.65 | 0.563 | 0.611 | 0.573 | 0.474 | 0.443 | 0.645 |
| nemo-colembed-1b | 0.556 | 0.713 | 0.441 | 0.473 | 0.609 | 0.589 | 0.626 | 0.570 | 0.466 | 0.424 | 0.647 |
| colqwen2.5 | 0.519 | 0.723 | 0.459 | 0.381 | 0.597 | 0.523 | 0.579 | 0.512 | 0.413 | 0.391 | 0.613 |
| binomic-7b | 0.490 | 0.666 | 0.442 | 0.367 | 0.575 | 0.488 | 0.589 | 0.462 | 0.379 | 0.360 | 0.578 |
| colqwen2 | 0.447 | 0.686 | 0.416 | 0.357 | 0.488 | 0.390 | 0.522 | 0.451 | 0.383 | 0.200 | 0.574 |
| binomic-3b | 0.443 | 0.585 | 0.420 | 0.322 | 0.514 | 0.442 | 0.553 | 0.433 | 0.332 | 0.289 | 0.537 |
| colpali-v1.3 | 0.431 | 0.653 | 0.417 | 0.329 | 0.471 | 0.344 | 0.531 | 0.448 | 0.356 | 0.218 | 0.540 |
| colmodernvbert | 0.245 | 0.353 | 0.212 | 0.196 | 0.305 | 0.270 | 0.317 | 0.183 | 0.144 | 0.179 | 0.293 |
| colsmol256 | 0.214 | 0.288 | 0.161 | 0.183 | 0.248 | 0.232 | 0.278 | 0.165 | 0.129 | 0.157 | 0.298 |
Query type difficulty analysis
We break down the score distribution by query type and task for nemo-retriever-colembed-3b. Model performance aligns well with the expected difficulty of each query type: open-ended (NDCG@10 = 0.438) and multi-hop queries (0.515) are the hardest to retrieve, while extractive (0.668) and boolean (0.657) are the easiest.
| Query Type | Average | Computer ScienceEN | PhysicsFR | Energy-FR | Finance-EN | Pharmaceuticals EN | HREN | IndustrialEN | Finance-FR |
|---|---|---|---|---|---|---|---|---|---|
| Extractive | 0.668 | 0.777 | 0.526 | 0.767 | 0.661 | 0.744 | 0.723 | 0.663 | 0.547 |
| Boolean | 0.657 | 0.825 | 0.501 | 0.741 | 0.729 | 0.747 | 0.547 | 0.626 | 0.410 |
| Numerical | 0.633 | 0.712 | 0.596 | 0.725 | 0.587 | 0.832 | 0.647 | 0.703 | 0.488 |
| Compare-Contrast | 0.590 | 0.799 | 0.581 | 0.694 | 0.466 | 0.669 | 0.552 | 0.478 | 0.490 |
| Enumerative | 0.546 | 0.712 | 0.307 | 0.549 | 0.675 | 0.667 | 0.562 | 0.347 | 0.397 |
| Multi-hop | 0.515 | 0.710 | 0.415 | 0.359 | 0.597 | 0.701 | 0.603 | 0.446 | 0.183 |
| Open-ended | 0.438 | 0.709 | 0.375 | 0.475 | 0.529 | 0.489 | 0.498 | 0.209 | 0.324 |
Benchmark scope
Coverage of enterprise documents: A key challenge in developing this benchmark was the limited availability of in-domain multi-modal documents. While significant effort was made to curate relevant documents, the corpora may not fully represent proprietary enterprise data in all contexts.
Language coverage: The benchmark is currently limited to French and English documents. While we attempted to curate relevant documents in additional languages, resource constraints prevented broader language coverage. To mitigate this limitation, queries were translated into multiple languages to enable evaluation of cross-lingual tasks.
Annotation quality: Achieving perfect annotation quality is challenging at this scale and task complexity/scope. We implemented a multi-layered quality-control framework incorporating both advanced LLMs/VLMs and senior human annotators throughout the pipeline to validate quality and minimize type 1/2 errors. Despite this rigorous validation process, some annotation errors may remain in the benchmark.
Usage
Evaluation
Here is a quick script on how to evaluate colqwen2.5-v0.2 on the new benchmark using MTEB (as of right now, not merged yet in the main branch) :
import mteb
benchmark = mteb.get_benchmark("ViDoRe(v3)")
model = mteb.get_model("vidore/colqwen2.5-v0.2")
results = mteb.evaluate(model=model, tasks=benchmark)
Sample visualization
Here is a simple script to visualize a query/answer pair, with bounding boxes plotted on the relevant pages.
from datasets import load_dataset
dataset_name = "vidore/vidore_v3_industrial"
dataset = {
"queries": load_dataset(dataset_name, data_dir="queries", split="test"),
"qrels": load_dataset(dataset_name, data_dir="qrels", split="test"),
"corpus": load_dataset(dataset_name, data_dir="corpus", split="test")
}
query_sample = dataset["queries"][8]
print('Query:', query_sample['query'])
print("Answer:", query_sample['answer'])
> Query: What type of airflow is required to maintain ultra-clean environments in aerospace operations?
> Answer: Laminar airflow is required to maintain ultra-clean environments in aerospace operations.
related_qrels = dataset["qrels"].filter(lambda x: x['query_id'] == query_sample['query_id'])
import matplotlib.pyplot as plt
import matplotlib.patches as patches
def plot_bbox(image, bboxes):
_, ax = plt.subplots(figsize=(18, 12))
ax.imshow(image), ax.axis('off')
for bbox in bboxes:
rect = patches.Rectangle((bbox['x1'], bbox['y1']), bbox['x2'] - bbox['x1'], bbox['y2'] - bbox['y1'], linewidth=2, edgecolor='r', facecolor='none')
ax.add_patch(rect)
plt.show()
for qrel in related_qrels:
plot_bbox(dataset["corpus"][qrel['corpus_id']]['image'], qrel['bounding_boxes'])
Acknowledgements
This work was granted access to the HPC resources of IDRIS (Jean Zay cluster)under the allocation AD011016393 made by GENCI. This project would not have been realized without the commitment of everyone involved—a debt we owe to all our annotators and colleagues.
Thanks also to the MTEB team for having collaborated with the private datasets.
Finally, thank you to those at NVIDIA who contributed to the design and development of this benchmark : Tom Balough, Gabriel Moreira, Bo Liu, Eric Tramel, Mengyao Xu, Radek Osmulski, Erin Potter, Hannah Brandon for their invaluable help and advice.
Links
- Dataset Collection: https://hf.co/collections/vidore/vidore-benchmark-v3
- HF Org 🤗: https://huggingface.co/vidore
- Paper 📄: Coming soon
- Codebase 💻: Coming soon






