ViDoRe V3: a comprehensive evaluation of retrieval for enterprise use-cases

Community Article Published November 5, 2025

Upvote

TL;DR

Why we built ViDoRe V3

Our core contributions
Public datasets

Private datasets

Query categories

A hybrid generation process: how to build a challenging benchmark

A hard benchmark for current retrieval models
English evaluation results

Multilingual results

Query type difficulty analysis

Benchmark scope

Usage
Evaluation

Sample visualization

Acknowledgements

Links

TL;DR

ILLUIN Technology is proud to release the ViDoRe V3 benchmark, designed and developed with contributions from NVIDIA. ViDoRe V3 is our latest benchmark, engineered to set a new industry gold standard for multi-modal, enterprise document retrieval evaluation. It addresses a critical challenge in production RAG systems: retrieving accurate information from complex, visually-rich documents.

ViDoRe V3 improves on existing RAG benchmarks by prioritizing enterprise relevance and rigorous data quality. Instead of relying on clean academic texts, the benchmark draws from 10 challenging, real-world datasets spanning diverse industrial domains, with 8 publicly released and 2 kept private. In addition, while previous benchmarks often rely on synthetically generated data, ViDoRe V3 features human-created and human-verified annotations.

This benchmark contains 26,000 pages and 3,099 queries translated into 6 languages. Each query is linked to retrieval ground truth data created and verified by human annotators: relevant pages, precise bounding box annotations for key elements, and a comprehensive reference answer.

Why we built ViDoRe V3

The document retrieval landscape is ever more diverse. New pipelines based on Visual Language Models are challenging traditional systems based on text-retrieval/generation models. With ViDoRe V1 and V2, we showed the first steps towards better evaluating VLM retrievers:

ViDoRe V1 focused on extractive queries based on single pages,
ViDoRe V2 extended the benchmark to more open-ended queries.

But corpora were still small compared to real world use cases and both relied heavily on synthetic generation. While these were steps in the right direction, previous benchmarks still left us with a fragmented picture. Corpora needed to be larger and more representative of enterprise data, queries more diverse, end-to-end evaluation was difficult, and human verification was missing.

Our core contributions

To address limitations of previous benchmarks, we focused on 3 main improvement pillars:

Enterprise-relevant Corpora: We assembled 10 diverse corpora, each focusing on a distinct, enterprise-relevant domain or task. For each domain, we curated 1,000+ pages from diverse, permissively licensed multimodal documents that mirror real-world enterprise retrieval challenges and complexity. 8 datasets are publicly released and 2 are private to prevent overfitting.
Human Verification of annotations: For each query, we provide human-annotated page-relevancy rankings, bounding boxes, and written answers to enable comprehensive retrieval/RAG evaluation.
Diverse queries: To systematically identify failure modes, queries span 7 types (e.g., multi-hop, numerical) and 3 formats (question, instruction, keyword). To assess cross-language capabilities, all queries are provided in 6 languages: English, French, Spanish, German, Italian, and Portuguese.

Public datasets

Dataset	Domain	Corpus Language	Main Modalities	# Pages	# Queries (without translations)
French Public Company Annual Reports	Finance-FR	French	Text, Table, Charts	2384	320
U.S. Public Company Annual Reports	Finance-EN	English	Text, Table	2942	309
Computer Science Textbooks	Computer Science	English	Text, Infographic, Tables	1360	215
HR Reports from EU	HR	English	Text, Table, Charts	1110	318
French Governmental Energy Reports	Energy	French	Text, Charts	2229	308
USAF Technical Orders	Industrial	English	Text, Tables, Infographics, Images	5244	283
FDA Reports	Pharmaceuticals	English	Text, Charts, Images, Infographic, Tables	2313	364
French Physics Lectures	Physics	French	Text, Images, Infographics	1674	302

Private datasets

Two datasets will remain private and will be managed by the MTEB team (big thanks to them!) to ensure benchmark integrity and mitigate overfitting. This evaluation approach should provide a less biased method for assessing visual retriever models, leading to a more representative understanding of their true capabilities. To avoid revealing too much detail about these datasets, we are only disclosing the domain and the language of the documents.

The two private datasets cover:

Energy-related regulatory documents (English)
Telecom-related technical standard documents (English)

Query categories

We designed the ViDoRe V3 queries to reflect the diversity and complexity of real-world retrieval tasks. Each query is formatted as a question, instruction, or keyword and tagged with one or more of 7 query types.

Query Type Definitions
Open-Ended	A query requiring synthesis and explanation of information. The answer must integrate multiple concepts into a coherent narrative rather than citing a single fact.
Compare-Contrast	A query requiring identification and articulation of similarities and/or differences between two or more entities, concepts, or topics.
Enumerative	A query requesting a complete list of items that meet specific criteria.
Numerical	A query expecting a numerical value, obtained either by direct extraction or calculation.
Boolean	A query expecting a yes/no answer, potentially requiring reasoning over extracted information.
Extractive	A query answerable by directly citing a specific fact or piece of information from the documents.
Multi-hop	A query requiring information retrieval from multiple distinct sources or sections, which must then be combined to produce a complete answer.

To illustrate which combinations of query types tend to occur together and how common each one is, we visualize the distribution and cardinality of all combinations. Single-category queries are most common, but many queries blend multiple types, such as extractive questions requiring numerical comparisons.

We paid particular attention to ensuring queries are challenging for current retrieval systems across all domains. Most queries require information spread across multiple pages, forcing models to extract and synthesize content from the entire documents rather than relying on single-page matches.

A hybrid generation process: how to build a challenging benchmark

To build a benchmark that is robust, difficult, and high-quality, we developed a sophisticated hybrid process that balances human expertise with LLM-driven scaling. Our goal was to create realistic queries, so we began with a page-agnostic approach. Similarly to ViDoRe V2, instead of using a single page, queries were generated from high-level summaries of document sections. This prevents the tasks from being overly simple and ensures they mimic real-world user intent. This generation was achieved through both synthetic pipelines (including NVIDIA NeMo Data Designer with Qwen3-235B) for scale and expert human annotators for nuance and complexity.

With thousands of generated queries for thousands of corpus pages, finding the correct ground-truth answers required a massive annotation effort. We implemented a multi-stage funnel to scale this process. First, a VLM (Qwen2.5-32B) performed a high-recall loose filter to rapidly discard clearly irrelevant pages, limiting false negatives and focusing the annotators' effort. Following this pre-filtering, trained human annotators performed the critical work. They identified the true relevant pages and generated the final, detailed annotations, including page-level relevancy rankings, precise written answers, and ground-truth bounding boxes.

While perfect ground truth is an elusive goal for any dataset at this scale, we invested heavily in enforcing a multi-layered quality control framework. Our annotators had native-level language proficiency and all passed pre-production validation and pilot gates. Key tasks were completed by multiple annotators to ensure consensus, and data underwent both quality control and audit checks by experienced senior annotators. This layered approach was designed to make the ground truth and the benchmark tasks as reliable and realistic as possible.

As a final quality assurance step, we rigorously filtered the annotations. This involved checking for annotator consensus, performing manual review, and using Qwen2.5-VL-32B to confirm the presence of relevant information across the annotated pages. We then leveraged Qwen2.5-VL-32B one last time to merge the remaining outputs into one single golden answer.

A hard benchmark for current retrieval models

We evaluate a wide range of modern visual retrieval models on our benchmark using the MTEB framework. The results confirm that the benchmark is exceptionally challenging for current methods.

The best-performing models reach a score of 65% NDCG@10 on English datasets. When introducing multilingual documents and translated queries, performance degrades significantly, with the average score failing to reach 60% NDCG@10.

A deeper analysis of the results reveals several key patterns:

Challenges with technical documents: Models struggle significantly when faced with the highly technical documents in our Industrial subset and our private Energy-EN set, particularly with interpreting dense schematics and complex graphs.
Persistent multilingual challenges: Performance drops considerably on our French documents. For the Physics and Finance-FR splits, no model was able to reach 50% NDCG@10.
Relative strength in computer science: Models demonstrate higher performance on the Computer Science split. We hypothesize this is a spill-over effect from the massive amount of coding data used to train modern VLMs, making them more knowledgeable about that domain.

The full, detailed evaluation results and a deeper analysis of the dataset's difficulty are available below. All the metrics reported, unless stated otherwise, are NDCG@10.

English evaluation results

Model	Average	Computer ScienceEN	Energy-EN	Finance-EN	PharmaceuticalsEN	HREN	IndustrialEN	TelecomEN
nemo-colembed-3b	0.656	0.778	0.534	0.695	0.669	0.649	0.570	0.694
nemo-colembed-1b	0.643	0.755	0.522	0.670	0.662	0.645	0.561	0.687
jinav4	0.639	0.742	0.524	0.661	0.652	0.646	0.559	0.687
colnomic-7b	0.630	0.782	0.482	0.631	0.646	0.629	0.542	0.696
colnomic-3b	0.617	0.755	0.455	0.630	0.637	0.626	0.528	0.686
colqwen2.5	0.592	0.752	0.429	0.612	0.609	0.592	0.494	0.653
nomic-7b (dense)	0.573	0.709	0.423	0.576	0.638	0.559	0.485	0.620
colqwen2	0.563	0.735	0.441	0.509	0.581	0.547	0.498	0.632
colpali-v1.3	0.530	0.725	0.381	0.433	0.577	0.533	0.470	0.592
nomic-3b (dense)	0.517	0.621	0.372	0.533	0.592	0.519	0.411	0.572
colmodernvbert	0.507	0.597	0.420	0.504	0.566	0.470	0.439	0.552
colsmol256	0.464	0.574	0.365	0.477	0.514	0.460	0.385	0.475

Multilingual results

Model	Average	Computer ScienceEN	PhysicsFR	Energy-EN	Energy-FR	Finance-EN	PharmaceuticalsEN	HREN	IndustrialEN	Finance-FR	TelecomEN
jinav4	0.576	0.718	0.466	0.500	0.640	0.593	0.631	0.595	0.504	0.461	0.648
colnomic-7b	0.574	0.762	0.483	0.450	0.640	0.566	0.623	0.587	0.501	0.455	0.672
nemo-colembed-3b	0.573	0.752	0.451	0.491	0.621	0.609	0.637	0.587	0.471	0.438	0.670
colnomic-3b	0.558	0.727	0.475	0.421	0.65	0.563	0.611	0.573	0.474	0.443	0.645
nemo-colembed-1b	0.556	0.713	0.441	0.473	0.609	0.589	0.626	0.570	0.466	0.424	0.647
colqwen2.5	0.519	0.723	0.459	0.381	0.597	0.523	0.579	0.512	0.413	0.391	0.613
binomic-7b	0.490	0.666	0.442	0.367	0.575	0.488	0.589	0.462	0.379	0.360	0.578
colqwen2	0.447	0.686	0.416	0.357	0.488	0.390	0.522	0.451	0.383	0.200	0.574
binomic-3b	0.443	0.585	0.420	0.322	0.514	0.442	0.553	0.433	0.332	0.289	0.537
colpali-v1.3	0.431	0.653	0.417	0.329	0.471	0.344	0.531	0.448	0.356	0.218	0.540
colmodernvbert	0.245	0.353	0.212	0.196	0.305	0.270	0.317	0.183	0.144	0.179	0.293
colsmol256	0.214	0.288	0.161	0.183	0.248	0.232	0.278	0.165	0.129	0.157	0.298

Query type difficulty analysis

We break down the score distribution by query type and task for nemo-retriever-colembed-3b. Model performance aligns well with the expected difficulty of each query type: open-ended (NDCG@10 = 0.438) and multi-hop queries (0.515) are the hardest to retrieve, while extractive (0.668) and boolean (0.657) are the easiest.

Query Type	Average	Computer ScienceEN	PhysicsFR	Energy-FR	Finance-EN	Pharmaceuticals EN	HREN	IndustrialEN	Finance-FR
Extractive	0.668	0.777	0.526	0.767	0.661	0.744	0.723	0.663	0.547
Boolean	0.657	0.825	0.501	0.741	0.729	0.747	0.547	0.626	0.410
Numerical	0.633	0.712	0.596	0.725	0.587	0.832	0.647	0.703	0.488
Compare-Contrast	0.590	0.799	0.581	0.694	0.466	0.669	0.552	0.478	0.490
Enumerative	0.546	0.712	0.307	0.549	0.675	0.667	0.562	0.347	0.397
Multi-hop	0.515	0.710	0.415	0.359	0.597	0.701	0.603	0.446	0.183
Open-ended	0.438	0.709	0.375	0.475	0.529	0.489	0.498	0.209	0.324

Benchmark scope

Coverage of enterprise documents: A key challenge in developing this benchmark was the limited availability of in-domain multi-modal documents. While significant effort was made to curate relevant documents, the corpora may not fully represent proprietary enterprise data in all contexts.

Language coverage: The benchmark is currently limited to French and English documents. While we attempted to curate relevant documents in additional languages, resource constraints prevented broader language coverage. To mitigate this limitation, queries were translated into multiple languages to enable evaluation of cross-lingual tasks.

Annotation quality: Achieving perfect annotation quality is challenging at this scale and task complexity/scope. We implemented a multi-layered quality-control framework incorporating both advanced LLMs/VLMs and senior human annotators throughout the pipeline to validate quality and minimize type 1/2 errors. Despite this rigorous validation process, some annotation errors may remain in the benchmark.

Usage

Evaluation

Here is a quick script on how to evaluate colqwen2.5-v0.2 on the new benchmark using MTEB (as of right now, not merged yet in the main branch) :

import mteb

benchmark = mteb.get_benchmark("ViDoRe(v3)")
model = mteb.get_model("vidore/colqwen2.5-v0.2")

results = mteb.evaluate(model=model, tasks=benchmark)

Sample visualization

Here is a simple script to visualize a query/answer pair, with bounding boxes plotted on the relevant pages.

from datasets import load_dataset

dataset_name = "vidore/vidore_v3_industrial"

dataset = {
    "queries": load_dataset(dataset_name, data_dir="queries", split="test"),
    "qrels": load_dataset(dataset_name, data_dir="qrels", split="test"),
    "corpus": load_dataset(dataset_name, data_dir="corpus", split="test")
}

query_sample = dataset["queries"][8]
print('Query:', query_sample['query'])
print("Answer:", query_sample['answer'])

> Query: What type of airflow is required to maintain ultra-clean environments in aerospace operations?
> Answer: Laminar airflow is required to maintain ultra-clean environments in aerospace operations.

related_qrels = dataset["qrels"].filter(lambda x: x['query_id'] == query_sample['query_id'])

import matplotlib.pyplot as plt
import matplotlib.patches as patches

def plot_bbox(image, bboxes):
    _, ax = plt.subplots(figsize=(18, 12))
    ax.imshow(image), ax.axis('off')
    for bbox in bboxes:
        rect = patches.Rectangle((bbox['x1'], bbox['y1']), bbox['x2'] - bbox['x1'], bbox['y2'] - bbox['y1'], linewidth=2, edgecolor='r', facecolor='none')
        ax.add_patch(rect)
    plt.show()

for qrel in related_qrels:
    plot_bbox(dataset["corpus"][qrel['corpus_id']]['image'], qrel['bounding_boxes'])

Acknowledgements

This work was granted access to the HPC resources of IDRIS (Jean Zay cluster)under the allocation AD011016393 made by GENCI. This project would not have been realized without the commitment of everyone involved—a debt we owe to all our annotators and colleagues.

Thanks also to the MTEB team for having collaborated with the private datasets.

Finally, thank you to those at NVIDIA who contributed to the design and development of this benchmark : Tom Balough, Gabriel Moreira, Bo Liu, Eric Tramel, Mengyao Xu, Radek Osmulski, Erin Potter, Hannah Brandon for their invaluable help and advice.

Links

Dataset Collection: https://hf.co/collections/vidore/vidore-benchmark-v3
HF Org 🤗: https://huggingface.co/vidore
Paper 📄: Coming soon
Codebase 💻: Coming soon

Community

Chrisyichuan

2 days ago

Hi, QQ here. How do you measure the sownstream generation quality here, since you have a golden label like answer in the query.
If I want to measure the generation quality, I am not sure what I can do, I don't quite understand how you label the answer, though

QuentinJG

Article author 2 days ago

•

edited 2 days ago

Hey @Chrisyichuan ! to give more details on answer annotations :

Answers were annotated by trained humans for each query, sometimes multiple times by annotators (this raw annotation is in the raw_answers column)
The "final" answer (answer column) was VLM generated with Qwen2.5-VL-32B : we gave it all the relevant pages, and all answers from annotators, and asked it to merge them. We saw qualitatively that it worked quite well.

(We also translated all answers in the 6 languages of the datasets using a big Qwen3 LLM)

Hope that makes things clearer for you !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote