Spaces:

sniro23
/

VedaMD-Backend-v2

Sleeping

App Files Files Community

VedaMD-Backend-v2 / PIPELINE_GUIDE.md

sniro23

Production ready: Clean codebase + Cerebras + Automated pipeline

b4971bd about 1 month ago

preview code

raw

history blame

14.1 kB

VedaMD Document Pipeline Guide

Complete guide for adding and managing medical documents in VedaMD

Overview
Quick Start
Building Vector Store from Scratch
Adding Single Documents
Updating Existing Documents
Uploading to Hugging Face
Advanced Usage
Troubleshooting

Overview

What is the Pipeline?

The VedaMD pipeline automates the process of converting medical PDF documents into a searchable vector store that powers the RAG system.

Before Pipeline (Manual Process):

PDF → Extract Text → Chunk → Embed → Build FAISS → Upload to HF
  ↓        ↓          ↓        ↓         ↓            ↓
Hours    Manual     Script   Script   External    Manual
         Work       Needed   Needed    Tool       Upload

With Pipeline (Automated):

PDF → python add_document.py file.pdf → Done ✅
  ↓
Minutes

Pipeline Components

build_vector_store.py - Build complete vector store from directory of PDFs
add_document.py - Add single documents to existing vector store
Automatic Features:
- PDF text extraction (PyMuPDF, pdfplumber, OCR fallback)
- Smart medical chunking
- Duplicate detection
- Quality validation
- HF Hub integration
- Automatic backups

Quick Start

Prerequisites

All required packages are already installed in your .venv:

✅ PyMuPDF (PDF extraction)
✅ pdfplumber (backup PDF extraction)
✅ sentence-transformers (embeddings)
✅ faiss-cpu (vector indexing)
✅ huggingface-hub (uploading)

30-Second Test

# Activate environment
cd "/Users/niro/Documents/SL Clinical Assistant"
source .venv/bin/activate

# Build vector store from your existing PDFs
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

# That's it! ✅

Building Vector Store from Scratch

Basic Usage

Build a vector store from all PDFs in a directory:

python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

Expected output: ``` 🚀 STARTING VECTOR STORE BUILD

🔍 Scanning for PDFs in Obs ✅ Found 15 PDF files 📄 Breech.pdf 📄 RhESUS.pdf ... (13 more)

============================================================ 📄 Processing: Breech.pdf

📄 Extracting with PyMuPDF: Obs/Breech.pdf ✅ Extracted 1988 characters from 1 pages 📝 Chunking text from Breech.pdf ✅ Created 2 chunks from Breech.pdf 🧮 Generating embeddings for 2 chunks... ✅ Processed Breech.pdf: 2 chunks added

... (processes all PDFs)

============================================================ ✅ BUILD COMPLETE!

📊 Summary: • PDFs processed: 15 • Total chunks: 247 • Embedding dimension: 384 • Output directory: ./data/vector_store • Build time: 45.23 seconds


### Customizing Chunk Size

For longer/shorter chunks:

```bash
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --chunk-size 1500 \
  --chunk-overlap 150

Recommendations:

chunk-size: 800-1200 (default: 1000)
chunk-overlap: 50-200 (default: 100)
Smaller chunks = more precise retrieval
Larger chunks = better context

Using Different Embedding Model

python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --embedding-model "sentence-transformers/all-mpnet-base-v2"

Available models:

all-MiniLM-L6-v2 (default) - Fast, 384d, good quality
all-mpnet-base-v2 - Better quality, 768d, slower
multi-qa-mpnet-base-dot-v1 - Optimized for Q&A

Build and Upload to HF

python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store

Note: Requires HF_TOKEN environment variable or --hf-token argument

Adding Single Documents

Basic Usage

Add a new guideline to existing vector store:

python scripts/add_document.py \
  --file ./new_guideline.pdf \
  --citation "SLCOG Hypertension Guidelines 2025" \
  --category "Obstetrics" \
  --vector-store-dir ./data/vector_store

Expected output: ```

📄 Adding document: new_guideline.pdf

📄 Extracting with PyMuPDF: ./new_guideline.pdf ✅ Extracted 12,456 characters from 8 pages 🔑 File hash: a3f2c9d8e1b0... 🔍 Checking for duplicates... ✅ No duplicates found 📝 Created 14 chunks 🧮 Generating embeddings... 📊 Adding to FAISS index... ✅ Added 14 chunks to vector store 📊 New total: 261 vectors

============================================================ 💾 Saving updated vector store...

📦 Backup created: data/vector_store/backups/20251023_150000 ✅ Saved FAISS index ✅ Saved documents ✅ Saved metadata ✅ Updated config

============================================================ ✅ DOCUMENT ADDED SUCCESSFULLY!

📊 Summary: • Chunks added: 14 • Total vectors: 261 • Time taken: 8.43 seconds


### Add and Upload to HF

```bash
python scripts/add_document.py \
  --file ./new_guideline.pdf \
  --citation "WHO Guidelines 2025" \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store

Allow Duplicates

By default, duplicate detection is enabled. To force add:

python scripts/add_document.py \
  --file ./updated_guideline.pdf \
  --vector-store-dir ./data/vector_store \
  --no-duplicate-check

Updating Existing Documents

To update an existing guideline:

Add new version (recommended):

python scripts/add_document.py \
  --file ./guidelines_v2.pdf \
  --citation "SLCOG Hypertension Guidelines 2025 v2" \
  --vector-store-dir ./data/vector_store

Rebuild from scratch (if major changes):

# Move old PDFs to archive
mkdir -p Obs/archive
mv Obs/old_guideline.pdf Obs/archive/

# Add new version
cp ~/Downloads/new_guideline.pdf Obs/

# Rebuild
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

Uploading to Hugging Face

Setup HF Token

# Option 1: Environment variable (recommended)
export HF_TOKEN="hf_your_token_here"

# Option 2: Pass as argument
python scripts/build_vector_store.py --hf-token "hf_your_token_here" ...

Initial Upload

python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store

Incremental Upload

After adding a document:

python scripts/add_document.py \
  --file ./new.pdf \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store

What Gets Uploaded

✅ faiss_index.bin - FAISS vector index
✅ documents.json - Document chunks
✅ metadata.json - Citations, sources, sections
✅ config.json - Configuration settings
✅ build_log.json - Build information

Advanced Usage

Batch Processing Multiple Files

# Create a script to add multiple files
for pdf in new_guidelines/*.pdf; do
  python scripts/add_document.py \
    --file "$pdf" \
    --citation "$(basename "$pdf" .pdf)" \
    --vector-store-dir ./data/vector_store
done

# Then upload once
python scripts/add_document.py \
  --file dummy.pdf \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store \
  --no-duplicate-check

Inspecting Vector Store

# View config
cat data/vector_store/config.json

# View build log
cat data/vector_store/build_log.json | python -m json.tool

# Count documents
python -c "import json; print(len(json.load(open('data/vector_store/documents.json'))))"

# List sources
python -c "import json; meta=json.load(open('data/vector_store/metadata.json')); print(set(m['source'] for m in meta))"

Backup Management

Backups are created automatically in data/vector_store/backups/:

# List backups
ls -lh data/vector_store/backups/

# Restore from backup (if needed)
cp data/vector_store/backups/20251023_150000/* data/vector_store/

Quality Checks

Check extraction quality for a specific PDF:

from scripts.build_vector_store import PDFExtractor

text, metadata = PDFExtractor.extract_text("Obs/Breech.pdf")
print(f"Extracted {len(text)} characters")
print(f"Pages: {metadata['pages']}")
print(f"Method: {metadata['method']}")
print(f"\nFirst 500 chars:\n{text[:500]}")

Troubleshooting

Issue: "No PDF files found"

Solution:

# Check directory exists
ls -la ./Obs

# Use absolute path
python scripts/build_vector_store.py \
  --input-dir "/Users/niro/Documents/SL Clinical Assistant/Obs" \
  --output-dir ./data/vector_store

Issue: "Extracted text too short"

Causes:

Scanned PDF (image-based)
Encrypted PDF
Corrupted PDF

Solution:

# Check PDF manually
open Obs/problematic.pdf

# Try with OCR (requires tesseract)
pip install pytesseract
# Script will auto-fallback to OCR

Issue: "Embedding dimension mismatch"

Solution:

# Check existing config
cat data/vector_store/config.json

# Rebuild with same model
python scripts/build_vector_store.py \
  --embedding-model "sentence-transformers/all-MiniLM-L6-v2" \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

Issue: "Upload failed"

Solution:

# Check HF token
echo $HF_TOKEN

# Test token
python -c "from huggingface_hub import HfApi; print(HfApi(token='$HF_TOKEN').whoami())"

# Create repo first
python -c "from huggingface_hub import create_repo; create_repo('sniro23/VedaMD-Vector-Store', repo_type='dataset', exist_ok=True)"

Issue: "Out of memory"

Solution:

# Reduce batch size in script (edit build_vector_store.py)
# Line ~338: change batch_size=32 to batch_size=8

# Or process PDFs in smaller batches
mkdir -p Obs/batch1 Obs/batch2
# Move PDFs into batches
python scripts/build_vector_store.py --input-dir Obs/batch1 ...
python scripts/add_document.py --file Obs/batch2/*.pdf ...

Issue: "Duplicate detected but I want to update"

Solution:

# Option 1: Force add (creates duplicate)
python scripts/add_document.py \
  --file ./updated.pdf \
  --no-duplicate-check \
  --vector-store-dir ./data/vector_store

# Option 2: Rebuild from scratch
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

Best Practices

1. Organize Your PDFs

Obs/
├── obstetrics/
│   ├── preeclampsia.pdf
│   ├── hemorrhage.pdf
│   └── ...
├── cardiology/
│   └── ...
└── general/
    └── ...

2. Use Meaningful Citations

# Good
--citation "SLCOG Preeclampsia Management Guidelines 2025"

# Bad
--citation "guideline.pdf"

3. Regular Backups

# Before major changes
cp -r data/vector_store data/vector_store_backup_$(date +%Y%m%d)

4. Test Before Uploading

# Build locally first
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./test_vs

# Test with RAG system
# Then upload
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload

5. Version Control

Add to .gitignore:

data/vector_store/
test_vector_store/
*.log
backups/

Keep in Git:

scripts/
Obs/
requirements.txt

Integration with VedaMD

Using Your Vector Store

After building, update your RAG system:

# In enhanced_groq_medical_rag.py or wherever vector store is loaded

# Option 1: Load from local directory
vector_store = SimpleVectorStore("./data/vector_store")

# Option 2: Load from HF Hub
vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")

Automatic Reloading

For production, reload vector store periodically:

import schedule
import time

def reload_vector_store():
    global vector_store
    vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
    logger.info("✅ Vector store reloaded")

# Reload every 6 hours
schedule.every(6).hours.do(reload_vector_store)

while True:
    schedule.run_pending()
    time.sleep(60)

Next Steps

Build your initial vector store:

python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store

Upload to HF:

python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload --repo-id sniro23/VedaMD-Vector-Store

Test with RAG system:

python -c "from src.enhanced_groq_medical_rag import EnhancedGroqMedicalRAG; rag = EnhancedGroqMedicalRAG(); print(rag.query('What is preeclampsia?'))"

Add new documents as they arrive:

python scripts/add_document.py --file ./new.pdf --vector-store-dir ./data/vector_store --upload

Questions or Issues?

Check the logs:

vector_store_build.log - Build process
add_document.log - Document additions

Or review the scripts:

Last Updated: October 23, 2025 Version: 1.0.0

VedaMD Document Pipeline Guide

Table of Contents

Overview

What is the Pipeline?

Pipeline Components

Quick Start

Prerequisites

30-Second Test

Building Vector Store from Scratch

Basic Usage

Expected output: ``` 🚀 STARTING VECTOR STORE BUILD

============================================================ 📄 Processing: Breech.pdf

============================================================ ✅ BUILD COMPLETE!

📊 Summary: • PDFs processed: 15 • Total chunks: 247 • Embedding dimension: 384 • Output directory: ./data/vector_store • Build time: 45.23 seconds

Using Different Embedding Model

Build and Upload to HF

Adding Single Documents

Basic Usage

Expected output: ```

📄 Adding document: new_guideline.pdf

============================================================ 💾 Saving updated vector store...

============================================================ ✅ DOCUMENT ADDED SUCCESSFULLY!

📊 Summary: • Chunks added: 14 • Total vectors: 261 • Time taken: 8.43 seconds

Allow Duplicates

Updating Existing Documents

Uploading to Hugging Face

Setup HF Token

Initial Upload

Incremental Upload

What Gets Uploaded

Advanced Usage

Batch Processing Multiple Files

Inspecting Vector Store

Backup Management

Quality Checks

Troubleshooting

Issue: "No PDF files found"

Issue: "Extracted text too short"

Issue: "Embedding dimension mismatch"

Issue: "Upload failed"

Issue: "Out of memory"

Issue: "Duplicate detected but I want to update"

Best Practices

1. Organize Your PDFs

2. Use Meaningful Citations

3. Regular Backups

4. Test Before Uploading

5. Version Control

Integration with VedaMD

Using Your Vector Store

Automatic Reloading

Next Steps