VedaMD-Backend-v2 / PIPELINE_GUIDE.md
sniro23's picture
Production ready: Clean codebase + Cerebras + Automated pipeline
b4971bd
|
raw
history blame
14.1 kB

VedaMD Document Pipeline Guide

Complete guide for adding and managing medical documents in VedaMD


Table of Contents

  1. Overview
  2. Quick Start
  3. Building Vector Store from Scratch
  4. Adding Single Documents
  5. Updating Existing Documents
  6. Uploading to Hugging Face
  7. Advanced Usage
  8. Troubleshooting

Overview

What is the Pipeline?

The VedaMD pipeline automates the process of converting medical PDF documents into a searchable vector store that powers the RAG system.

Before Pipeline (Manual Process):

PDF โ†’ Extract Text โ†’ Chunk โ†’ Embed โ†’ Build FAISS โ†’ Upload to HF
  โ†“        โ†“          โ†“        โ†“         โ†“            โ†“
Hours    Manual     Script   Script   External    Manual
         Work       Needed   Needed    Tool       Upload

With Pipeline (Automated):

PDF โ†’ python add_document.py file.pdf โ†’ Done โœ…
  โ†“
Minutes

Pipeline Components

  1. build_vector_store.py - Build complete vector store from directory of PDFs
  2. add_document.py - Add single documents to existing vector store
  3. Automatic Features:
    • PDF text extraction (PyMuPDF, pdfplumber, OCR fallback)
    • Smart medical chunking
    • Duplicate detection
    • Quality validation
    • HF Hub integration
    • Automatic backups

Quick Start

Prerequisites

All required packages are already installed in your .venv:

  • โœ… PyMuPDF (PDF extraction)
  • โœ… pdfplumber (backup PDF extraction)
  • โœ… sentence-transformers (embeddings)
  • โœ… faiss-cpu (vector indexing)
  • โœ… huggingface-hub (uploading)

30-Second Test

# Activate environment
cd "/Users/niro/Documents/SL Clinical Assistant"
source .venv/bin/activate

# Build vector store from your existing PDFs
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

# That's it! โœ…

Building Vector Store from Scratch

Basic Usage

Build a vector store from all PDFs in a directory:

python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

Expected output: ``` ๐Ÿš€ STARTING VECTOR STORE BUILD

๐Ÿ” Scanning for PDFs in Obs โœ… Found 15 PDF files ๐Ÿ“„ Breech.pdf ๐Ÿ“„ RhESUS.pdf ... (13 more)

============================================================ ๐Ÿ“„ Processing: Breech.pdf

๐Ÿ“„ Extracting with PyMuPDF: Obs/Breech.pdf โœ… Extracted 1988 characters from 1 pages ๐Ÿ“ Chunking text from Breech.pdf โœ… Created 2 chunks from Breech.pdf ๐Ÿงฎ Generating embeddings for 2 chunks... โœ… Processed Breech.pdf: 2 chunks added

... (processes all PDFs)

============================================================ โœ… BUILD COMPLETE!

๐Ÿ“Š Summary: โ€ข PDFs processed: 15 โ€ข Total chunks: 247 โ€ข Embedding dimension: 384 โ€ข Output directory: ./data/vector_store โ€ข Build time: 45.23 seconds


### Customizing Chunk Size

For longer/shorter chunks:

```bash
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --chunk-size 1500 \
  --chunk-overlap 150

Recommendations:

  • chunk-size: 800-1200 (default: 1000)
  • chunk-overlap: 50-200 (default: 100)
  • Smaller chunks = more precise retrieval
  • Larger chunks = better context

Using Different Embedding Model

python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --embedding-model "sentence-transformers/all-mpnet-base-v2"

Available models:

  • all-MiniLM-L6-v2 (default) - Fast, 384d, good quality
  • all-mpnet-base-v2 - Better quality, 768d, slower
  • multi-qa-mpnet-base-dot-v1 - Optimized for Q&A

Build and Upload to HF

python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store

Note: Requires HF_TOKEN environment variable or --hf-token argument


Adding Single Documents

Basic Usage

Add a new guideline to existing vector store:

python scripts/add_document.py \
  --file ./new_guideline.pdf \
  --citation "SLCOG Hypertension Guidelines 2025" \
  --category "Obstetrics" \
  --vector-store-dir ./data/vector_store

Expected output: ```

๐Ÿ“„ Adding document: new_guideline.pdf

๐Ÿ“„ Extracting with PyMuPDF: ./new_guideline.pdf โœ… Extracted 12,456 characters from 8 pages ๐Ÿ”‘ File hash: a3f2c9d8e1b0... ๐Ÿ” Checking for duplicates... โœ… No duplicates found ๐Ÿ“ Created 14 chunks ๐Ÿงฎ Generating embeddings... ๐Ÿ“Š Adding to FAISS index... โœ… Added 14 chunks to vector store ๐Ÿ“Š New total: 261 vectors

============================================================ ๐Ÿ’พ Saving updated vector store...

๐Ÿ“ฆ Backup created: data/vector_store/backups/20251023_150000 โœ… Saved FAISS index โœ… Saved documents โœ… Saved metadata โœ… Updated config

============================================================ โœ… DOCUMENT ADDED SUCCESSFULLY!

๐Ÿ“Š Summary: โ€ข Chunks added: 14 โ€ข Total vectors: 261 โ€ข Time taken: 8.43 seconds


### Add and Upload to HF

```bash
python scripts/add_document.py \
  --file ./new_guideline.pdf \
  --citation "WHO Guidelines 2025" \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store

Allow Duplicates

By default, duplicate detection is enabled. To force add:

python scripts/add_document.py \
  --file ./updated_guideline.pdf \
  --vector-store-dir ./data/vector_store \
  --no-duplicate-check

Updating Existing Documents

To update an existing guideline:

  1. Add new version (recommended):
python scripts/add_document.py \
  --file ./guidelines_v2.pdf \
  --citation "SLCOG Hypertension Guidelines 2025 v2" \
  --vector-store-dir ./data/vector_store
  1. Rebuild from scratch (if major changes):
# Move old PDFs to archive
mkdir -p Obs/archive
mv Obs/old_guideline.pdf Obs/archive/

# Add new version
cp ~/Downloads/new_guideline.pdf Obs/

# Rebuild
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

Uploading to Hugging Face

Setup HF Token

# Option 1: Environment variable (recommended)
export HF_TOKEN="hf_your_token_here"

# Option 2: Pass as argument
python scripts/build_vector_store.py --hf-token "hf_your_token_here" ...

Initial Upload

python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store

Incremental Upload

After adding a document:

python scripts/add_document.py \
  --file ./new.pdf \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store

What Gets Uploaded

  • โœ… faiss_index.bin - FAISS vector index
  • โœ… documents.json - Document chunks
  • โœ… metadata.json - Citations, sources, sections
  • โœ… config.json - Configuration settings
  • โœ… build_log.json - Build information

Advanced Usage

Batch Processing Multiple Files

# Create a script to add multiple files
for pdf in new_guidelines/*.pdf; do
  python scripts/add_document.py \
    --file "$pdf" \
    --citation "$(basename "$pdf" .pdf)" \
    --vector-store-dir ./data/vector_store
done

# Then upload once
python scripts/add_document.py \
  --file dummy.pdf \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store \
  --no-duplicate-check

Inspecting Vector Store

# View config
cat data/vector_store/config.json

# View build log
cat data/vector_store/build_log.json | python -m json.tool

# Count documents
python -c "import json; print(len(json.load(open('data/vector_store/documents.json'))))"

# List sources
python -c "import json; meta=json.load(open('data/vector_store/metadata.json')); print(set(m['source'] for m in meta))"

Backup Management

Backups are created automatically in data/vector_store/backups/:

# List backups
ls -lh data/vector_store/backups/

# Restore from backup (if needed)
cp data/vector_store/backups/20251023_150000/* data/vector_store/

Quality Checks

Check extraction quality for a specific PDF:

from scripts.build_vector_store import PDFExtractor

text, metadata = PDFExtractor.extract_text("Obs/Breech.pdf")
print(f"Extracted {len(text)} characters")
print(f"Pages: {metadata['pages']}")
print(f"Method: {metadata['method']}")
print(f"\nFirst 500 chars:\n{text[:500]}")

Troubleshooting

Issue: "No PDF files found"

Solution:

# Check directory exists
ls -la ./Obs

# Use absolute path
python scripts/build_vector_store.py \
  --input-dir "/Users/niro/Documents/SL Clinical Assistant/Obs" \
  --output-dir ./data/vector_store

Issue: "Extracted text too short"

Causes:

  • Scanned PDF (image-based)
  • Encrypted PDF
  • Corrupted PDF

Solution:

# Check PDF manually
open Obs/problematic.pdf

# Try with OCR (requires tesseract)
pip install pytesseract
# Script will auto-fallback to OCR

Issue: "Embedding dimension mismatch"

Solution:

# Check existing config
cat data/vector_store/config.json

# Rebuild with same model
python scripts/build_vector_store.py \
  --embedding-model "sentence-transformers/all-MiniLM-L6-v2" \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

Issue: "Upload failed"

Solution:

# Check HF token
echo $HF_TOKEN

# Test token
python -c "from huggingface_hub import HfApi; print(HfApi(token='$HF_TOKEN').whoami())"

# Create repo first
python -c "from huggingface_hub import create_repo; create_repo('sniro23/VedaMD-Vector-Store', repo_type='dataset', exist_ok=True)"

Issue: "Out of memory"

Solution:

# Reduce batch size in script (edit build_vector_store.py)
# Line ~338: change batch_size=32 to batch_size=8

# Or process PDFs in smaller batches
mkdir -p Obs/batch1 Obs/batch2
# Move PDFs into batches
python scripts/build_vector_store.py --input-dir Obs/batch1 ...
python scripts/add_document.py --file Obs/batch2/*.pdf ...

Issue: "Duplicate detected but I want to update"

Solution:

# Option 1: Force add (creates duplicate)
python scripts/add_document.py \
  --file ./updated.pdf \
  --no-duplicate-check \
  --vector-store-dir ./data/vector_store

# Option 2: Rebuild from scratch
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

Best Practices

1. Organize Your PDFs

Obs/
โ”œโ”€โ”€ obstetrics/
โ”‚   โ”œโ”€โ”€ preeclampsia.pdf
โ”‚   โ”œโ”€โ”€ hemorrhage.pdf
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ cardiology/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ general/
    โ””โ”€โ”€ ...

2. Use Meaningful Citations

# Good
--citation "SLCOG Preeclampsia Management Guidelines 2025"

# Bad
--citation "guideline.pdf"

3. Regular Backups

# Before major changes
cp -r data/vector_store data/vector_store_backup_$(date +%Y%m%d)

4. Test Before Uploading

# Build locally first
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./test_vs

# Test with RAG system
# Then upload
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload

5. Version Control

Add to .gitignore:

data/vector_store/
test_vector_store/
*.log
backups/

Keep in Git:

scripts/
Obs/
requirements.txt

Integration with VedaMD

Using Your Vector Store

After building, update your RAG system:

# In enhanced_groq_medical_rag.py or wherever vector store is loaded

# Option 1: Load from local directory
vector_store = SimpleVectorStore("./data/vector_store")

# Option 2: Load from HF Hub
vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")

Automatic Reloading

For production, reload vector store periodically:

import schedule
import time

def reload_vector_store():
    global vector_store
    vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
    logger.info("โœ… Vector store reloaded")

# Reload every 6 hours
schedule.every(6).hours.do(reload_vector_store)

while True:
    schedule.run_pending()
    time.sleep(60)

Next Steps

  1. Build your initial vector store:

    python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store
    
  2. Upload to HF:

    python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload --repo-id sniro23/VedaMD-Vector-Store
    
  3. Test with RAG system:

    python -c "from src.enhanced_groq_medical_rag import EnhancedGroqMedicalRAG; rag = EnhancedGroqMedicalRAG(); print(rag.query('What is preeclampsia?'))"
    
  4. Add new documents as they arrive:

    python scripts/add_document.py --file ./new.pdf --vector-store-dir ./data/vector_store --upload
    

Questions or Issues?

Check the logs:

  • vector_store_build.log - Build process
  • add_document.log - Document additions

Or review the scripts:


Last Updated: October 23, 2025 Version: 1.0.0