Spaces:
Sleeping
VedaMD Document Pipeline Guide
Complete guide for adding and managing medical documents in VedaMD
Table of Contents
- Overview
- Quick Start
- Building Vector Store from Scratch
- Adding Single Documents
- Updating Existing Documents
- Uploading to Hugging Face
- Advanced Usage
- Troubleshooting
Overview
What is the Pipeline?
The VedaMD pipeline automates the process of converting medical PDF documents into a searchable vector store that powers the RAG system.
Before Pipeline (Manual Process):
PDF โ Extract Text โ Chunk โ Embed โ Build FAISS โ Upload to HF
โ โ โ โ โ โ
Hours Manual Script Script External Manual
Work Needed Needed Tool Upload
With Pipeline (Automated):
PDF โ python add_document.py file.pdf โ Done โ
โ
Minutes
Pipeline Components
- build_vector_store.py - Build complete vector store from directory of PDFs
- add_document.py - Add single documents to existing vector store
- Automatic Features:
- PDF text extraction (PyMuPDF, pdfplumber, OCR fallback)
- Smart medical chunking
- Duplicate detection
- Quality validation
- HF Hub integration
- Automatic backups
Quick Start
Prerequisites
All required packages are already installed in your .venv:
- โ PyMuPDF (PDF extraction)
- โ pdfplumber (backup PDF extraction)
- โ sentence-transformers (embeddings)
- โ faiss-cpu (vector indexing)
- โ huggingface-hub (uploading)
30-Second Test
# Activate environment
cd "/Users/niro/Documents/SL Clinical Assistant"
source .venv/bin/activate
# Build vector store from your existing PDFs
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store
# That's it! โ
Building Vector Store from Scratch
Basic Usage
Build a vector store from all PDFs in a directory:
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store
Expected output: ``` ๐ STARTING VECTOR STORE BUILD
๐ Scanning for PDFs in Obs โ Found 15 PDF files ๐ Breech.pdf ๐ RhESUS.pdf ... (13 more)
============================================================ ๐ Processing: Breech.pdf
๐ Extracting with PyMuPDF: Obs/Breech.pdf โ Extracted 1988 characters from 1 pages ๐ Chunking text from Breech.pdf โ Created 2 chunks from Breech.pdf ๐งฎ Generating embeddings for 2 chunks... โ Processed Breech.pdf: 2 chunks added
... (processes all PDFs)
============================================================ โ BUILD COMPLETE!
๐ Summary: โข PDFs processed: 15 โข Total chunks: 247 โข Embedding dimension: 384 โข Output directory: ./data/vector_store โข Build time: 45.23 seconds
### Customizing Chunk Size
For longer/shorter chunks:
```bash
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store \
--chunk-size 1500 \
--chunk-overlap 150
Recommendations:
- chunk-size: 800-1200 (default: 1000)
- chunk-overlap: 50-200 (default: 100)
- Smaller chunks = more precise retrieval
- Larger chunks = better context
Using Different Embedding Model
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store \
--embedding-model "sentence-transformers/all-mpnet-base-v2"
Available models:
all-MiniLM-L6-v2(default) - Fast, 384d, good qualityall-mpnet-base-v2- Better quality, 768d, slowermulti-qa-mpnet-base-dot-v1- Optimized for Q&A
Build and Upload to HF
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store
Note: Requires HF_TOKEN environment variable or --hf-token argument
Adding Single Documents
Basic Usage
Add a new guideline to existing vector store:
python scripts/add_document.py \
--file ./new_guideline.pdf \
--citation "SLCOG Hypertension Guidelines 2025" \
--category "Obstetrics" \
--vector-store-dir ./data/vector_store
Expected output: ```
๐ Adding document: new_guideline.pdf
๐ Extracting with PyMuPDF: ./new_guideline.pdf โ Extracted 12,456 characters from 8 pages ๐ File hash: a3f2c9d8e1b0... ๐ Checking for duplicates... โ No duplicates found ๐ Created 14 chunks ๐งฎ Generating embeddings... ๐ Adding to FAISS index... โ Added 14 chunks to vector store ๐ New total: 261 vectors
============================================================ ๐พ Saving updated vector store...
๐ฆ Backup created: data/vector_store/backups/20251023_150000 โ Saved FAISS index โ Saved documents โ Saved metadata โ Updated config
============================================================ โ DOCUMENT ADDED SUCCESSFULLY!
๐ Summary: โข Chunks added: 14 โข Total vectors: 261 โข Time taken: 8.43 seconds
### Add and Upload to HF
```bash
python scripts/add_document.py \
--file ./new_guideline.pdf \
--citation "WHO Guidelines 2025" \
--vector-store-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store
Allow Duplicates
By default, duplicate detection is enabled. To force add:
python scripts/add_document.py \
--file ./updated_guideline.pdf \
--vector-store-dir ./data/vector_store \
--no-duplicate-check
Updating Existing Documents
To update an existing guideline:
- Add new version (recommended):
python scripts/add_document.py \
--file ./guidelines_v2.pdf \
--citation "SLCOG Hypertension Guidelines 2025 v2" \
--vector-store-dir ./data/vector_store
- Rebuild from scratch (if major changes):
# Move old PDFs to archive
mkdir -p Obs/archive
mv Obs/old_guideline.pdf Obs/archive/
# Add new version
cp ~/Downloads/new_guideline.pdf Obs/
# Rebuild
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store
Uploading to Hugging Face
Setup HF Token
# Option 1: Environment variable (recommended)
export HF_TOKEN="hf_your_token_here"
# Option 2: Pass as argument
python scripts/build_vector_store.py --hf-token "hf_your_token_here" ...
Initial Upload
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store
Incremental Upload
After adding a document:
python scripts/add_document.py \
--file ./new.pdf \
--vector-store-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store
What Gets Uploaded
- โ
faiss_index.bin- FAISS vector index - โ
documents.json- Document chunks - โ
metadata.json- Citations, sources, sections - โ
config.json- Configuration settings - โ
build_log.json- Build information
Advanced Usage
Batch Processing Multiple Files
# Create a script to add multiple files
for pdf in new_guidelines/*.pdf; do
python scripts/add_document.py \
--file "$pdf" \
--citation "$(basename "$pdf" .pdf)" \
--vector-store-dir ./data/vector_store
done
# Then upload once
python scripts/add_document.py \
--file dummy.pdf \
--vector-store-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store \
--no-duplicate-check
Inspecting Vector Store
# View config
cat data/vector_store/config.json
# View build log
cat data/vector_store/build_log.json | python -m json.tool
# Count documents
python -c "import json; print(len(json.load(open('data/vector_store/documents.json'))))"
# List sources
python -c "import json; meta=json.load(open('data/vector_store/metadata.json')); print(set(m['source'] for m in meta))"
Backup Management
Backups are created automatically in data/vector_store/backups/:
# List backups
ls -lh data/vector_store/backups/
# Restore from backup (if needed)
cp data/vector_store/backups/20251023_150000/* data/vector_store/
Quality Checks
Check extraction quality for a specific PDF:
from scripts.build_vector_store import PDFExtractor
text, metadata = PDFExtractor.extract_text("Obs/Breech.pdf")
print(f"Extracted {len(text)} characters")
print(f"Pages: {metadata['pages']}")
print(f"Method: {metadata['method']}")
print(f"\nFirst 500 chars:\n{text[:500]}")
Troubleshooting
Issue: "No PDF files found"
Solution:
# Check directory exists
ls -la ./Obs
# Use absolute path
python scripts/build_vector_store.py \
--input-dir "/Users/niro/Documents/SL Clinical Assistant/Obs" \
--output-dir ./data/vector_store
Issue: "Extracted text too short"
Causes:
- Scanned PDF (image-based)
- Encrypted PDF
- Corrupted PDF
Solution:
# Check PDF manually
open Obs/problematic.pdf
# Try with OCR (requires tesseract)
pip install pytesseract
# Script will auto-fallback to OCR
Issue: "Embedding dimension mismatch"
Solution:
# Check existing config
cat data/vector_store/config.json
# Rebuild with same model
python scripts/build_vector_store.py \
--embedding-model "sentence-transformers/all-MiniLM-L6-v2" \
--input-dir ./Obs \
--output-dir ./data/vector_store
Issue: "Upload failed"
Solution:
# Check HF token
echo $HF_TOKEN
# Test token
python -c "from huggingface_hub import HfApi; print(HfApi(token='$HF_TOKEN').whoami())"
# Create repo first
python -c "from huggingface_hub import create_repo; create_repo('sniro23/VedaMD-Vector-Store', repo_type='dataset', exist_ok=True)"
Issue: "Out of memory"
Solution:
# Reduce batch size in script (edit build_vector_store.py)
# Line ~338: change batch_size=32 to batch_size=8
# Or process PDFs in smaller batches
mkdir -p Obs/batch1 Obs/batch2
# Move PDFs into batches
python scripts/build_vector_store.py --input-dir Obs/batch1 ...
python scripts/add_document.py --file Obs/batch2/*.pdf ...
Issue: "Duplicate detected but I want to update"
Solution:
# Option 1: Force add (creates duplicate)
python scripts/add_document.py \
--file ./updated.pdf \
--no-duplicate-check \
--vector-store-dir ./data/vector_store
# Option 2: Rebuild from scratch
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store
Best Practices
1. Organize Your PDFs
Obs/
โโโ obstetrics/
โ โโโ preeclampsia.pdf
โ โโโ hemorrhage.pdf
โ โโโ ...
โโโ cardiology/
โ โโโ ...
โโโ general/
โโโ ...
2. Use Meaningful Citations
# Good
--citation "SLCOG Preeclampsia Management Guidelines 2025"
# Bad
--citation "guideline.pdf"
3. Regular Backups
# Before major changes
cp -r data/vector_store data/vector_store_backup_$(date +%Y%m%d)
4. Test Before Uploading
# Build locally first
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./test_vs
# Test with RAG system
# Then upload
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload
5. Version Control
Add to .gitignore:
data/vector_store/
test_vector_store/
*.log
backups/
Keep in Git:
scripts/
Obs/
requirements.txt
Integration with VedaMD
Using Your Vector Store
After building, update your RAG system:
# In enhanced_groq_medical_rag.py or wherever vector store is loaded
# Option 1: Load from local directory
vector_store = SimpleVectorStore("./data/vector_store")
# Option 2: Load from HF Hub
vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
Automatic Reloading
For production, reload vector store periodically:
import schedule
import time
def reload_vector_store():
global vector_store
vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
logger.info("โ
Vector store reloaded")
# Reload every 6 hours
schedule.every(6).hours.do(reload_vector_store)
while True:
schedule.run_pending()
time.sleep(60)
Next Steps
Build your initial vector store:
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_storeUpload to HF:
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload --repo-id sniro23/VedaMD-Vector-StoreTest with RAG system:
python -c "from src.enhanced_groq_medical_rag import EnhancedGroqMedicalRAG; rag = EnhancedGroqMedicalRAG(); print(rag.query('What is preeclampsia?'))"Add new documents as they arrive:
python scripts/add_document.py --file ./new.pdf --vector-store-dir ./data/vector_store --upload
Questions or Issues?
Check the logs:
vector_store_build.log- Build processadd_document.log- Document additions
Or review the scripts:
Last Updated: October 23, 2025 Version: 1.0.0