ViDoRe V3: a comprehensive evaluation of retrieval for enterprise use-cases

Community Article Published November 5, 2025

TL;DR

ILLUIN Technology is proud to release the ViDoRe V3 benchmark, designed and developed with contributions from NVIDIA. ViDoRe V3 is our latest benchmark, engineered to set a new industry gold standard for multi-modal, enterprise document retrieval evaluation. It addresses a critical challenge in production RAG systems: retrieving accurate information from complex, visually-rich documents.

ViDoRe V3 improves on existing RAG benchmarks by prioritizing enterprise relevance and rigorous data quality. Instead of relying on clean academic texts, the benchmark draws from 10 challenging, real-world datasets spanning diverse industrial domains, with 8 publicly released and 2 kept private. In addition, while previous benchmarks often rely on synthetically generated data, ViDoRe V3 features human-created and human-verified annotations.

This benchmark contains 26,000 pages and 3,099 queries translated into 6 languages. Each query is linked to retrieval ground truth data created and verified by human annotators: relevant pages, precise bounding box annotations for key elements, and a comprehensive reference answer.

query-example

Why we built ViDoRe V3

The document retrieval landscape is ever more diverse. New pipelines based on Visual Language Models are challenging traditional systems based on text-retrieval/generation models. With ViDoRe V1 and V2, we showed the first steps towards better evaluating VLM retrievers:

  • ViDoRe V1 focused on extractive queries based on single pages,
  • ViDoRe V2 extended the benchmark to more open-ended queries.

But corpora were still small compared to real world use cases and both relied heavily on synthetic generation. While these were steps in the right direction, previous benchmarks still left us with a fragmented picture. Corpora needed to be larger and more representative of enterprise data, queries more diverse, end-to-end evaluation was difficult, and human verification was missing.

Our core contributions

To address limitations of previous benchmarks, we focused on 3 main improvement pillars:

  1. Enterprise-relevant Corpora: We assembled 10 diverse corpora, each focusing on a distinct, enterprise-relevant domain or task. For each domain, we curated 1,000+ pages from diverse, permissively licensed multimodal documents that mirror real-world enterprise retrieval challenges and complexity. 8 datasets are publicly released and 2 are private to prevent overfitting.
  2. Human Verification of annotations: For each query, we provide human-annotated page-relevancy rankings, bounding boxes, and written answers to enable comprehensive retrieval/RAG evaluation.
  3. Diverse queries: To systematically identify failure modes, queries span 7 types (e.g., multi-hop, numerical) and 3 formats (question, instruction, keyword). To assess cross-language capabilities, all queries are provided in 6 languages: English, French, Spanish, German, Italian, and Portuguese.

Public datasets

Dataset Domain Corpus Language Main Modalities # Pages # Queries (without translations)
French Public Company Annual Reports Finance-FR French Text, Table, Charts 2384 320
U.S. Public Company Annual Reports Finance-EN English Text, Table 2942 309
Computer Science Textbooks Computer Science English Text, Infographic, Tables 1360 215
HR Reports from EU HR English Text, Table, Charts 1110 318
French Governmental Energy Reports Energy French Text, Charts 2229 308
USAF Technical Orders Industrial English Text, Tables, Infographics, Images 5244 283
FDA Reports Pharmaceuticals English Text, Charts, Images, Infographic, Tables 2313 364
French Physics Lectures Physics French Text, Images, Infographics 1674 302

Private datasets

Two datasets will remain private and will be managed by the MTEB team (big thanks to them!) to ensure benchmark integrity and mitigate overfitting. This evaluation approach should provide a less biased method for assessing visual retriever models, leading to a more representative understanding of their true capabilities. To avoid revealing too much detail about these datasets, we are only disclosing the domain and the language of the documents.

The two private datasets cover:

  1. Energy-related regulatory documents (English)
  2. Telecom-related technical standard documents (English)

Query categories

datasets_by_query_types

We designed the ViDoRe V3 queries to reflect the diversity and complexity of real-world retrieval tasks. Each query is formatted as a question, instruction, or keyword and tagged with one or more of 7 query types.

Query Type Definitions
Open-Ended A query requiring synthesis and explanation of information. The answer must integrate multiple concepts into a coherent narrative rather than citing a single fact.
Compare-Contrast A query requiring identification and articulation of similarities and/or differences between two or more entities, concepts, or topics.
Enumerative A query requesting a complete list of items that meet specific criteria.
Numerical A query expecting a numerical value, obtained either by direct extraction or calculation.
Boolean A query expecting a yes/no answer, potentially requiring reasoning over extracted information.
Extractive A query answerable by directly citing a specific fact or piece of information from the documents.
Multi-hop A query requiring information retrieval from multiple distinct sources or sections, which must then be combined to produce a complete answer.

To illustrate which combinations of query types tend to occur together and how common each one is, we visualize the distribution and cardinality of all combinations. Single-category queries are most common, but many queries blend multiple types, such as extractive questions requiring numerical comparisons.

upset_plot_query_types_complete

We paid particular attention to ensuring queries are challenging for current retrieval systems across all domains. Most queries require information spread across multiple pages, forcing models to extract and synthesize content from the entire documents rather than relying on single-page matches.

ridge_plot_annotated_pages_per_query

A hybrid generation process: how to build a challenging benchmark

query-generation

To build a benchmark that is robust, difficult, and high-quality, we developed a sophisticated hybrid process that balances human expertise with LLM-driven scaling. Our goal was to create realistic queries, so we began with a page-agnostic approach. Similarly to ViDoRe V2, instead of using a single page, queries were generated from high-level summaries of document sections. This prevents the tasks from being overly simple and ensures they mimic real-world user intent. This generation was achieved through both synthetic pipelines (including NVIDIA NeMo Data Designer with Qwen3-235B) for scale and expert human annotators for nuance and complexity.

With thousands of generated queries for thousands of corpus pages, finding the correct ground-truth answers required a massive annotation effort. We implemented a multi-stage funnel to scale this process. First, a VLM (Qwen2.5-32B) performed a high-recall loose filter to rapidly discard clearly irrelevant pages, limiting false negatives and focusing the annotators' effort. Following this pre-filtering, trained human annotators performed the critical work. They identified the true relevant pages and generated the final, detailed annotations, including page-level relevancy rankings, precise written answers, and ground-truth bounding boxes.

While perfect ground truth is an elusive goal for any dataset at this scale, we invested heavily in enforcing a multi-layered quality control framework. Our annotators had native-level language proficiency and all passed pre-production validation and pilot gates. Key tasks were completed by multiple annotators to ensure consensus, and data underwent both quality control and audit checks by experienced senior annotators. This layered approach was designed to make the ground truth and the benchmark tasks as reliable and realistic as possible.

As a final quality assurance step, we rigorously filtered the annotations. This involved checking for annotator consensus, performing manual review, and using Qwen2.5-VL-32B to confirm the presence of relevant information across the annotated pages. We then leveraged Qwen2.5-VL-32B one last time to merge the remaining outputs into one single golden answer.

A hard benchmark for current retrieval models

VN_comparison_V1V2_V3

We evaluate a wide range of modern visual retrieval models on our benchmark using the MTEB framework. The results confirm that the benchmark is exceptionally challenging for current methods.

The best-performing models reach a score of 65% NDCG@10 on English datasets. When introducing multilingual documents and translated queries, performance degrades significantly, with the average score failing to reach 60% NDCG@10.

A deeper analysis of the results reveals several key patterns:

  • Challenges with technical documents: Models struggle significantly when faced with the highly technical documents in our Industrial subset and our private Energy-EN set, particularly with interpreting dense schematics and complex graphs.
  • Persistent multilingual challenges: Performance drops considerably on our French documents. For the Physics and Finance-FR splits, no model was able to reach 50% NDCG@10.
  • Relative strength in computer science: Models demonstrate higher performance on the Computer Science split. We hypothesize this is a spill-over effect from the massive amount of coding data used to train modern VLMs, making them more knowledgeable about that domain.

The full, detailed evaluation results and a deeper analysis of the dataset's difficulty are available below. All the metrics reported, unless stated otherwise, are NDCG@10.

English evaluation results

Model Average Computer Science EN Energy-EN Finance-EN Pharmaceuticals EN HR EN Industrial EN Telecom EN
nemo-colembed-3b 0.656 0.778 0.534 0.695 0.669 0.649 0.570 0.694
nemo-colembed-1b 0.643 0.755 0.522 0.670 0.662 0.645 0.561 0.687
jinav4 0.639 0.742 0.524 0.661 0.652 0.646 0.559 0.687
colnomic-7b 0.630 0.782 0.482 0.631 0.646 0.629 0.542 0.696
colnomic-3b 0.617 0.755 0.455 0.630 0.637 0.626 0.528 0.686
colqwen2.5 0.592 0.752 0.429 0.612 0.609 0.592 0.494 0.653
nomic-7b (dense) 0.573 0.709 0.423 0.576 0.638 0.559 0.485 0.620
colqwen2 0.563 0.735 0.441 0.509 0.581 0.547 0.498 0.632
colpali-v1.3 0.530 0.725 0.381 0.433 0.577 0.533 0.470 0.592
nomic-3b (dense) 0.517 0.621 0.372 0.533 0.592 0.519 0.411 0.572
colmodernvbert 0.507 0.597 0.420 0.504 0.566 0.470 0.439 0.552
colsmol256 0.464 0.574 0.365 0.477 0.514 0.460 0.385 0.475

Multilingual results

Model Average Computer Science EN Physics FR Energy-EN Energy-FR Finance-EN Pharmaceuticals EN HR EN Industrial EN Finance-FR Telecom EN
jinav4 0.576 0.718 0.466 0.500 0.640 0.593 0.631 0.595 0.504 0.461 0.648
colnomic-7b 0.574 0.762 0.483 0.450 0.640 0.566 0.623 0.587 0.501 0.455 0.672
nemo-colembed-3b 0.573 0.752 0.451 0.491 0.621 0.609 0.637 0.587 0.471 0.438 0.670
colnomic-3b 0.558 0.727 0.475 0.421 0.65 0.563 0.611 0.573 0.474 0.443 0.645
nemo-colembed-1b 0.556 0.713 0.441 0.473 0.609 0.589 0.626 0.570 0.466 0.424 0.647
colqwen2.5 0.519 0.723 0.459 0.381 0.597 0.523 0.579 0.512 0.413 0.391 0.613
binomic-7b 0.490 0.666 0.442 0.367 0.575 0.488 0.589 0.462 0.379 0.360 0.578
colqwen2 0.447 0.686 0.416 0.357 0.488 0.390 0.522 0.451 0.383 0.200 0.574
binomic-3b 0.443 0.585 0.420 0.322 0.514 0.442 0.553 0.433 0.332 0.289 0.537
colpali-v1.3 0.431 0.653 0.417 0.329 0.471 0.344 0.531 0.448 0.356 0.218 0.540
colmodernvbert 0.245 0.353 0.212 0.196 0.305 0.270 0.317 0.183 0.144 0.179 0.293
colsmol256 0.214 0.288 0.161 0.183 0.248 0.232 0.278 0.165 0.129 0.157 0.298

Query type difficulty analysis

We break down the score distribution by query type and task for nemo-retriever-colembed-3b. Model performance aligns well with the expected difficulty of each query type: open-ended (NDCG@10 = 0.438) and multi-hop queries (0.515) are the hardest to retrieve, while extractive (0.668) and boolean (0.657) are the easiest.

Query Type Average Computer Science EN Physics FR Energy-FR Finance-EN Pharmaceuticals EN HR EN Industrial EN Finance-FR
Extractive 0.668 0.777 0.526 0.767 0.661 0.744 0.723 0.663 0.547
Boolean 0.657 0.825 0.501 0.741 0.729 0.747 0.547 0.626 0.410
Numerical 0.633 0.712 0.596 0.725 0.587 0.832 0.647 0.703 0.488
Compare-Contrast 0.590 0.799 0.581 0.694 0.466 0.669 0.552 0.478 0.490
Enumerative 0.546 0.712 0.307 0.549 0.675 0.667 0.562 0.347 0.397
Multi-hop 0.515 0.710 0.415 0.359 0.597 0.701 0.603 0.446 0.183
Open-ended 0.438 0.709 0.375 0.475 0.529 0.489 0.498 0.209 0.324

Benchmark scope

Coverage of enterprise documents: A key challenge in developing this benchmark was the limited availability of in-domain multi-modal documents. While significant effort was made to curate relevant documents, the corpora may not fully represent proprietary enterprise data in all contexts.

Language coverage: The benchmark is currently limited to French and English documents. While we attempted to curate relevant documents in additional languages, resource constraints prevented broader language coverage. To mitigate this limitation, queries were translated into multiple languages to enable evaluation of cross-lingual tasks.

Annotation quality: Achieving perfect annotation quality is challenging at this scale and task complexity/scope. We implemented a multi-layered quality-control framework incorporating both advanced LLMs/VLMs and senior human annotators throughout the pipeline to validate quality and minimize type 1/2 errors. Despite this rigorous validation process, some annotation errors may remain in the benchmark.

Usage

Evaluation

Here is a quick script on how to evaluate colqwen2.5-v0.2 on the new benchmark using MTEB (as of right now, not merged yet in the main branch) :

import mteb

benchmark = mteb.get_benchmark("ViDoRe(v3)")
model = mteb.get_model("vidore/colqwen2.5-v0.2")

results = mteb.evaluate(model=model, tasks=benchmark)

Sample visualization

Here is a simple script to visualize a query/answer pair, with bounding boxes plotted on the relevant pages.

from datasets import load_dataset

dataset_name = "vidore/vidore_v3_industrial"

dataset = {
    "queries": load_dataset(dataset_name, data_dir="queries", split="test"),
    "qrels": load_dataset(dataset_name, data_dir="qrels", split="test"),
    "corpus": load_dataset(dataset_name, data_dir="corpus", split="test")
}

query_sample = dataset["queries"][8]
print('Query:', query_sample['query'])
print("Answer:", query_sample['answer'])
> Query: What type of airflow is required to maintain ultra-clean environments in aerospace operations?
> Answer: Laminar airflow is required to maintain ultra-clean environments in aerospace operations.
related_qrels = dataset["qrels"].filter(lambda x: x['query_id'] == query_sample['query_id'])
import matplotlib.pyplot as plt
import matplotlib.patches as patches

def plot_bbox(image, bboxes):
    _, ax = plt.subplots(figsize=(18, 12))
    ax.imshow(image), ax.axis('off')
    for bbox in bboxes:
        rect = patches.Rectangle((bbox['x1'], bbox['y1']), bbox['x2'] - bbox['x1'], bbox['y2'] - bbox['y1'], linewidth=2, edgecolor='r', facecolor='none')
        ax.add_patch(rect)
    plt.show()

for qrel in related_qrels:
    plot_bbox(dataset["corpus"][qrel['corpus_id']]['image'], qrel['bounding_boxes'])

output-examples

Acknowledgements

This work was granted access to the HPC resources of IDRIS (Jean Zay cluster)under the allocation AD011016393 made by GENCI. This project would not have been realized without the commitment of everyone involved—a debt we owe to all our annotators and colleagues.

Thanks also to the MTEB team for having collaborated with the private datasets.

Finally, thank you to those at NVIDIA who contributed to the design and development of this benchmark : Tom Balough, Gabriel Moreira, Bo Liu, Eric Tramel, Mengyao Xu, Radek Osmulski, Erin Potter, Hannah Brandon for their invaluable help and advice.

Links

Community

Hi, QQ here. How do you measure the sownstream generation quality here, since you have a golden label like answer in the query.
If I want to measure the generation quality, I am not sure what I can do, I don't quite understand how you label the answer, though

·

Hey @Chrisyichuan ! to give more details on answer annotations :

  • Answers were annotated by trained humans for each query, sometimes multiple times by annotators (this raw annotation is in the raw_answers column)
  • The "final" answer (answer column) was VLM generated with Qwen2.5-VL-32B : we gave it all the relevant pages, and all answers from annotators, and asked it to merge them. We saw qualitatively that it worked quite well.

(We also translated all answers in the 6 languages of the datasets using a big Qwen3 LLM)

Hope that makes things clearer for you !

Sign up or log in to comment