Build Hallucination-Free RAG with Verbatim

Community Article Published November 18, 2025

Upvote

Adam Kovacs

adaamko

ChiliGround Logo
Chill, I Ground!

TL;DR:

Extract, Don't Generate: LLMs select exact text spans instead of generating -> zero hallucinations
Wrap Existing RAG: Integrate with LangChain/LlamaIndex in 15 lines of code
CPU-Only Pipeline: No GPU, no API costs, works offline (SPLADE + ModernBERT) -> lightweight and efficient
Template Control: Static, dynamic, or question-specific response formatting
Smart Chunking: Structure-aware with raw/enhanced dual-chunk system
Production-Ready:
- Full RAG framework to get you started on your own project
- Scalable (optional) cloud based compute or local storage
- Included API and frontend for easy deployment

Quick Links:

⭐ GitHub
PyPI
Paper - ACL BioNLP 2025
Models - Fine-tuned extractors

Extract, Don't Generate

The fundamental issue with RAG systems is that LLMs generate text probabilistically. Even when provided with perfect context, the model samples from a probability distribution over tokens, which inevitably introduces approximations, paraphrasing, and factual drift.

This problem compounds in modern agentic RAG systems. When a single query triggers 7-8 sequential LLM calls (planning, retrieval, synthesis, verification, etc.), each with its own hallucination probability, the cumulative error rate becomes significant. If each call has a 10% chance of introducing an error, an 8-step agentic workflow has roughly a ~60% probability of containing at least one hallucination.

Verbatim RAG takes a different approach: Instead of asking the LLM to generate an answer based on retrieved documents, we constrain it to extract exact text spans that answer the question. These spans are then composed into a response without any generative rewriting.

Verbatim RAG Architecture

Why this works: Extraction is a classification task (does this span answer the question?), not a generation task. The model never produces new tokens that approximate source content, it only identifies which existing tokens to include. This eliminates the issue of the probabilistic generation that causes hallucinations.

The result: every number, every fact, every claim in the response is directly traceable to the source text.

Quickstart: Zero Hallucinations in 15 Lines

Install the required dependencies:

!pip install verbatim-rag

Then create a Verbatim RAG system:

from verbatim_rag import VerbatimRAG, VerbatimIndex
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag.embedding_providers import SpladeProvider
from verbatim_rag.schema import DocumentSchema

# 1. Create index (storage layer) - works on CPU!
store = LocalMilvusStore("./demo.db", enable_sparse=True, enable_dense=False)
embedder = SpladeProvider("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill", device="cpu")
index = VerbatimIndex(vector_store=store, sparse_provider=embedder)

# 2. Add a document
doc = DocumentSchema(
    content="""
# Methods
We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.

# Results
The system achieved 42.01% accuracy on ArchEHR-QA.
""",
    title="Research Paper"
)
index.add_documents([doc])

# 3. Create RAG with hallucination prevention
rag = VerbatimRAG(index, model="gpt-4o-mini")

# 4. Get exact quotes, not hallucinations
response = rag.query("What approaches were used?")
print(response.answer)
# [1] We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.

Note: Use dense embeddings (SentenceTransformers, OpenAI) or hybrid search by swapping the embedding provider. See docs for details.

Already Have a RAG System? Wrap It in 15 Lines

A key integration pattern: wrap your existing LangChain/LlamaIndex retrieval with Verbatim's extraction layer. Your indexing pipeline stays unchanged, you're only adding the verbatim extraction step on top of your existing retrieval.

Install the required dependencies:

!pip install langchain langchain-openai langchain-community openai faiss-cpu langchain-text-splitters

Then wrap your existing LangChain RAG system:

from verbatim_rag.providers import RAGProvider
from verbatim_rag import verbatim_query

# Your existing LangChain setup
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

from typing import List, Dict, Any, Optional

# 1. Create sample documents
docs = [
    Document(
        page_content="""
# Methods
We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.

# Results
The system achieved 42.01% accuracy on ArchEHR-QA.
        """,
        metadata={"title": "Research Paper 1", "source": "paper1.pdf", "year": 2025},
    )
]

# 2. Index with LangChain (your existing setup)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
splits = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})


# 3. Wrap your LangChain retriever
class LangChainRAGProvider(RAGProvider):
    def __init__(self, langchain_retriever):
        self.retriever = langchain_retriever

    def retrieve(
        self, question: str, k: int = 5, filter: Optional[str] = None
    ) -> List[Dict[str, Any]]:
        # Use LangChain's retrieval
        docs = self.retriever.invoke(question)

        # Convert to Verbatim format
        context = []
        for doc in docs[:k]:
            context.append(
                {
                    "content": doc.page_content,
                    "title": doc.metadata.get("title", ""),
                    "source": doc.metadata.get("source", ""),
                    "metadata": doc.metadata,
                }
            )
        return context


# 4. Use Verbatim with your existing LangChain RAG
provider = LangChainRAGProvider(retriever)
response = verbatim_query(provider, "What methods were used?", k=5)

print(response.answer)
# Output: [1] We used two approaches: zero-shot LLM extraction and a fine-tuned ModernBERT classifier on 58k synthetic examples.

This approach preserves your existing retrieval system (embeddings, vector store, chunking strategy) while adding verbatim extraction. The only change is at the final answer generation step.

Why This Matters

Even with perfect retrieval, the generation step introduces hallucinations. Consider this example:

	Traditional RAG	Verbatim RAG
Question	"How much synthetic data?"	"How much synthetic data?"
Retrieved	"We created 60k examples"	"We created 60k examples"
Generated	"Around 58,000 examples"	"[1] We created 60k examples"

The LLM approximated "60k" to "58,000" during token generation—not because retrieval failed, but because the model is sampling from a learned distribution over approximate phrasings.

This happens because:

Token-level generation doesn't preserve exact numerals from context
Token prediction objective learns probability distributions over paraphrases, not verbatim copying mechanisms
Autoregressive sampling can drift from source text even with high context attention

Traditional RAG systems try to mitigate this through better prompting, retrieval, or reranking. But these don't address the root cause: the model is generating new tokens rather than selecting existing ones.

# Traditional RAG: Model generates tokens (probabilistic)
answer = "They generated approximately 58,000 synthetic training samples"

# Verbatim RAG: Model selects spans (deterministic given the extraction)
answer = "[1] We created 60k examples."

Architecture: Two Independent Layers

Verbatim RAG decouples retrieval from extraction through a clean interface boundary:

Layer 1: VerbatimIndex (Storage)

Handles document ingestion, chunking, embedding, and retrieval:

Input: Raw documents (PDFs, text, markdown)
Output: Top-k relevant chunks for a query
Components: Pluggable chunkers (markdown-aware, Chonkie), embedders (sparse/dense), vector stores (local/cloud Milvus)

Layer 2: Verbatim Core (Extraction)

Consumes retrieved chunks and produces verbatim answers:

Input: Query + retrieved chunks
Output: Answer composed entirely from extracted spans
Components: Span extractors (LLM or fine-tuned models), template managers, verification

The Interface

# Layer 1: Retrieval
chunks = index.query(question, k=5)  # Returns List[SearchResult]

# Layer 2: Extraction
spans = extractor.extract_spans(question, chunks)  # Returns Dict[str, List[str]]
answer = template_manager.compose(spans)  # Returns formatted response

This separation means you can:

Swap Layer 1 for your existing retrieval (LangChain, custom)
Use just Layer 2 to harden an existing RAG system
Replace components independently (e.g., switch from LLM to ModernBERT extractor without touching retrieval)

Index Real Research Papers

from verbatim_rag.schema import DocumentSchema

# Add multiple papers with metadata
papers = [
    DocumentSchema.from_url(
        url="https://aclanthology.org/L16-1417.pdf",
        title="Building Concept Graphs from Monolingual Dictionary Entries",
        doc_type="academic_paper",
        authors=["Gabor Recski"],
        conference="LREC",
        year=2016,
    ),
    DocumentSchema.from_url(
        url="https://aclanthology.org/2025.bionlp-share.8.pdf",
        title="KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering",
        doc_type="academic_paper",
        authors=["Adam Kovacs", "Paul Schmitt", "Gabor Recski"],
        conference="BioNLP",
        year=2025,
    ),
]

index.add_documents(papers)

Search with metadata filtering:

# Find papers from BioNLP
results = index.query(
    filter='metadata["conference"] == "BioNLP"',
    k=10
)

for result in results:
    print(f"{result.metadata['title']} ({result.metadata['year']})")
    print(f"Score: {result.score:.3f}")

For production, use cloud storage:

from verbatim_rag.vector_stores import CloudMilvusStore

store = CloudMilvusStore(
    uri="https://your-milvus-instance.com",
    token="your-token",
    collection_name="production"
)

CPU-Only Pipeline: No LLM API Calls

The entire pipeline can run on CPU without GPU or LLM API calls:

from verbatim_rag.extractors import ModelSpanExtractor
from verbatim_rag.embedding_providers import SpladeProvider

# SPLADE for embeddings (sparse representations are CPU-efficient)
embedder = SpladeProvider("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill", device="cpu")

# ModernBERT for extraction (fine-tuned on span classification)
extractor = ModelSpanExtractor(
    model_path="KRLabsOrg/verbatim-rag-modern-bert-v1",
    device="cpu"
)

# Complete CPU-only RAG
rag = VerbatimRAG(index, extractor=extractor)

Tradeoffs: The fine-tuned extractor has lower recall than LLM-based extractors on complex queries, but is deterministic, free, and works offline.

Template Management

Template management controls how extracted spans are composed into final responses. Three modes are available:

1. Static Mode - Fixed Format

Use when response formatting must be identical across all queries:

template = """
Based on our research documents:
[RELEVANT_SENTENCES]

Need more information? Contact [email protected]
"""

rag.template_manager.use_static_mode(template)

response = rag.query("What methods were used?")
# Output:
# Based on our research documents:
# [1] We used zero-shot LLM extraction and fine-tuned ModernBERT.
#
# Need more information? Contact [email protected]

The [RELEVANT_SENTENCES] placeholder is automatically replaced with numbered citations.

Use cases: Compliance-required formatting, API responses with fixed schemas, systems where output format must be predictable.

2. Dynamic Mode - Adaptive Formatting (Default)

The LLM generates a context-appropriate template for each query:

rag.template_manager.use_contextual_mode()  # Default

response = rag.query("Who are the authors?")
# Gets: "The authors are: [1] ..."

response = rag.query("What were the results?")
# Gets: "The system achieved: [1] ..."

Use cases: Conversational interfaces, chatbots, systems where natural language flow matters more than format consistency.

3. Question-Specific Mode - Template Matching

Sometimes you want different response formats for different question types, without the randomness. Define templates paired with example questions—the system automatically selects the best-fitting template using semantic similarity:

templates = [
    {
        "template": "**Methodology:**\n\n[RELEVANT_SENTENCES]\n\n*Methods extracted from source*",
        "examples": [
            "What methods were used?",
            "How was this done?",
            "Describe the methodology"
        ]
    },
    {
        "template": "**Key Findings:**\n\n[RELEVANT_SENTENCES]\n\n*Results from source*",
        "examples": [
            "What were the results?",
            "What did they find?",
            "What are the key findings?"
        ]
    }
]

rag.template_manager.use_question_specific_mode(templates)

# Questions automatically matched to appropriate templates
response = rag.query("How did they approach this?")
# Uses Methodology template

response = rag.query("What were the main findings?")
# Uses Key Findings template

The system embeds all example questions once and matches incoming queries using cosine similarity—no LLM call needed.

Developer Tools: Inspection & Debugging

Verbatim RAG includes built-in tools for understanding and debugging your index:

Inspect Your Index

# Get overview statistics
stats = index.inspect()

print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")
print(f"Document types: {stats['doc_types']}")

Examine Chunks Before Deployment

# Get first 5 chunks to see what was indexed
chunks = index.get_all_chunks(limit=5)

for chunk in chunks:
    print(f"Raw text: {chunk.text[:100]}...")
    print(f"Enhanced (embedded): {chunk.enhanced_text[:100]}...")
    print(f"Metadata: {chunk.metadata}")

Powerful Metadata Filtering

Add ANY custom fields to documents and filter at query time:

# Add documents with custom metadata
doc = DocumentSchema.from_url(
    url="https://aclanthology.org/2025.bionlp-share.8.pdf",
    year=2025,
    conference="BioNLP",
    category="medical-nlp",
    user_id="user123",  # Any custom field!
    project_id="proj_A"  # Works with any field
)

# Combine semantic search + complex filters
results = index.query(
    text="clinical question answering",
    filter='metadata["year"] >= 2024 && metadata["category"] == "medical-nlp"',
    k=10
)

Additional Features

Bring Your Own LLM

Works with any OpenAI-compatible endpoint:

from verbatim_rag.core import LLMClient

# Use Claude, local Ollama, Azure OpenAI, etc.
llm_client = LLMClient(
    model="moonshotai/kimi-k2-instruct-0905",
    api_base="https://api.groq.com/openai/v1/"
)
rag = VerbatimRAG(index, llm_client=llm_client)

PDF Processing with Docling

Automatically convert PDFs to structured markdown:

doc = DocumentSchema.from_url(
    url="https://aclanthology.org/2025.bionlp-share.8.pdf",
    title="Research Paper"
)
# Docling handles: PDF → Markdown with preserved structure

Technical Tradeoffs

When to use Verbatim RAG:

Verbatim RAG trades generation flexibility for factual precision. Use it when:

Exactness matters: Medical dosages, legal citations, financial figures must be character-perfect
Auditability required: Every claim needs to trace back to source text
Liability concerns: Approximate or paraphrased answers create legal/compliance risk
Existing RAG needs hardening: You have a working system but need to eliminate hallucinations

When traditional RAG may be preferable:

Synthesis across sources: Answering requires combining information from multiple passages in ways that don't exist verbatim in any single source
Natural language fluency: Responses need to be conversational and flowing, not citation-heavy
Abstract/analytical queries: Questions like "What are the main themes?" require interpretation, not extraction

Hybrid approach: Use Verbatim RAG for factual queries and traditional RAG for analytical/synthesis queries, routing based on query type.

Resources

Tutorial Notebook - Complete guide with examples
Documentation - API reference and advanced usage
HuggingFace Models - Fine-tuned extractors
GitHub Discussions - Questions and use cases
Report Issues
KR Labs - Enterprise support

Research & Citation

Verbatim RAG is based on research published at ACL BioNLP 2025.

Read the paper: ACL Anthology 2025

If you use Verbatim RAG in your research:

@inproceedings{kovacs-etal-2025-kr,
    title = "{KR} Labs at {A}rch{EHR}-{QA} 2025: A Verbatim Approach for Evidence-Based Question Answering",
    author = "Kovacs, Adam and Schmitt, Paul and Recski, Gabor",
    booktitle = "Proceedings of the 24th Workshop on Biomedical Language Processing",
    year = "2025",
    url = "https://aclanthology.org/2025.bionlp-share.8/",
    pages = "69--74"
}

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote