# VedaMD Document Pipeline Guide **Complete guide for adding and managing medical documents in VedaMD** --- ## Table of Contents 1. [Overview](#overview) 2. [Quick Start](#quick-start) 3. [Building Vector Store from Scratch](#building-vector-store-from-scratch) 4. [Adding Single Documents](#adding-single-documents) 5. [Updating Existing Documents](#updating-existing-documents) 6. [Uploading to Hugging Face](#uploading-to-hugging-face) 7. [Advanced Usage](#advanced-usage) 8. [Troubleshooting](#troubleshooting) --- ## Overview ### What is the Pipeline? The VedaMD pipeline automates the process of converting medical PDF documents into a searchable vector store that powers the RAG system. **Before Pipeline** (Manual Process): ``` PDF → Extract Text → Chunk → Embed → Build FAISS → Upload to HF ↓ ↓ ↓ ↓ ↓ ↓ Hours Manual Script Script External Manual Work Needed Needed Tool Upload ``` **With Pipeline** (Automated): ``` PDF → python add_document.py file.pdf → Done ✅ ↓ Minutes ``` ### Pipeline Components 1. **build_vector_store.py** - Build complete vector store from directory of PDFs 2. **add_document.py** - Add single documents to existing vector store 3. **Automatic Features**: - PDF text extraction (PyMuPDF, pdfplumber, OCR fallback) - Smart medical chunking - Duplicate detection - Quality validation - HF Hub integration - Automatic backups --- ## Quick Start ### Prerequisites All required packages are already installed in your `.venv`: - ✅ PyMuPDF (PDF extraction) - ✅ pdfplumber (backup PDF extraction) - ✅ sentence-transformers (embeddings) - ✅ faiss-cpu (vector indexing) - ✅ huggingface-hub (uploading) ### 30-Second Test ```bash # Activate environment cd "/Users/niro/Documents/SL Clinical Assistant" source .venv/bin/activate # Build vector store from your existing PDFs python scripts/build_vector_store.py \ --input-dir ./Obs \ --output-dir ./data/vector_store # That's it! ✅ ``` --- ## Building Vector Store from Scratch ### Basic Usage Build a vector store from all PDFs in a directory: ```bash python scripts/build_vector_store.py \ --input-dir ./Obs \ --output-dir ./data/vector_store ``` **Expected output:** ``` 🚀 STARTING VECTOR STORE BUILD ============================================================ 🔍 Scanning for PDFs in Obs ✅ Found 15 PDF files 📄 Breech.pdf 📄 RhESUS.pdf ... (13 more) ============================================================ 📄 Processing: Breech.pdf ============================================================ 📄 Extracting with PyMuPDF: Obs/Breech.pdf ✅ Extracted 1988 characters from 1 pages 📝 Chunking text from Breech.pdf ✅ Created 2 chunks from Breech.pdf 🧮 Generating embeddings for 2 chunks... ✅ Processed Breech.pdf: 2 chunks added ... (processes all PDFs) ============================================================ ✅ BUILD COMPLETE! ============================================================ 📊 Summary: • PDFs processed: 15 • Total chunks: 247 • Embedding dimension: 384 • Output directory: ./data/vector_store • Build time: 45.23 seconds ============================================================ ``` ### Customizing Chunk Size For longer/shorter chunks: ```bash python scripts/build_vector_store.py \ --input-dir ./Obs \ --output-dir ./data/vector_store \ --chunk-size 1500 \ --chunk-overlap 150 ``` **Recommendations:** - **chunk-size**: 800-1200 (default: 1000) - **chunk-overlap**: 50-200 (default: 100) - Smaller chunks = more precise retrieval - Larger chunks = better context ### Using Different Embedding Model ```bash python scripts/build_vector_store.py \ --input-dir ./Obs \ --output-dir ./data/vector_store \ --embedding-model "sentence-transformers/all-mpnet-base-v2" ``` **Available models:** - `all-MiniLM-L6-v2` (default) - Fast, 384d, good quality - `all-mpnet-base-v2` - Better quality, 768d, slower - `multi-qa-mpnet-base-dot-v1` - Optimized for Q&A ### Build and Upload to HF ```bash python scripts/build_vector_store.py \ --input-dir ./Obs \ --output-dir ./data/vector_store \ --upload \ --repo-id sniro23/VedaMD-Vector-Store ``` **Note**: Requires `HF_TOKEN` environment variable or `--hf-token` argument --- ## Adding Single Documents ### Basic Usage Add a new guideline to existing vector store: ```bash python scripts/add_document.py \ --file ./new_guideline.pdf \ --citation "SLCOG Hypertension Guidelines 2025" \ --category "Obstetrics" \ --vector-store-dir ./data/vector_store ``` **Expected output:** ``` ============================================================ 📄 Adding document: new_guideline.pdf ============================================================ 📄 Extracting with PyMuPDF: ./new_guideline.pdf ✅ Extracted 12,456 characters from 8 pages 🔑 File hash: a3f2c9d8e1b0... 🔍 Checking for duplicates... ✅ No duplicates found 📝 Created 14 chunks 🧮 Generating embeddings... 📊 Adding to FAISS index... ✅ Added 14 chunks to vector store 📊 New total: 261 vectors ============================================================ 💾 Saving updated vector store... ============================================================ 📦 Backup created: data/vector_store/backups/20251023_150000 ✅ Saved FAISS index ✅ Saved documents ✅ Saved metadata ✅ Updated config ============================================================ ✅ DOCUMENT ADDED SUCCESSFULLY! ============================================================ 📊 Summary: • Chunks added: 14 • Total vectors: 261 • Time taken: 8.43 seconds ============================================================ ``` ### Add and Upload to HF ```bash python scripts/add_document.py \ --file ./new_guideline.pdf \ --citation "WHO Guidelines 2025" \ --vector-store-dir ./data/vector_store \ --upload \ --repo-id sniro23/VedaMD-Vector-Store ``` ### Allow Duplicates By default, duplicate detection is enabled. To force add: ```bash python scripts/add_document.py \ --file ./updated_guideline.pdf \ --vector-store-dir ./data/vector_store \ --no-duplicate-check ``` --- ## Updating Existing Documents To update an existing guideline: 1. **Add new version** (recommended): ```bash python scripts/add_document.py \ --file ./guidelines_v2.pdf \ --citation "SLCOG Hypertension Guidelines 2025 v2" \ --vector-store-dir ./data/vector_store ``` 2. **Rebuild from scratch** (if major changes): ```bash # Move old PDFs to archive mkdir -p Obs/archive mv Obs/old_guideline.pdf Obs/archive/ # Add new version cp ~/Downloads/new_guideline.pdf Obs/ # Rebuild python scripts/build_vector_store.py \ --input-dir ./Obs \ --output-dir ./data/vector_store ``` --- ## Uploading to Hugging Face ### Setup HF Token ```bash # Option 1: Environment variable (recommended) export HF_TOKEN="hf_your_token_here" # Option 2: Pass as argument python scripts/build_vector_store.py --hf-token "hf_your_token_here" ... ``` ### Initial Upload ```bash python scripts/build_vector_store.py \ --input-dir ./Obs \ --output-dir ./data/vector_store \ --upload \ --repo-id sniro23/VedaMD-Vector-Store ``` ### Incremental Upload After adding a document: ```bash python scripts/add_document.py \ --file ./new.pdf \ --vector-store-dir ./data/vector_store \ --upload \ --repo-id sniro23/VedaMD-Vector-Store ``` ### What Gets Uploaded - ✅ `faiss_index.bin` - FAISS vector index - ✅ `documents.json` - Document chunks - ✅ `metadata.json` - Citations, sources, sections - ✅ `config.json` - Configuration settings - ✅ `build_log.json` - Build information --- ## Advanced Usage ### Batch Processing Multiple Files ```bash # Create a script to add multiple files for pdf in new_guidelines/*.pdf; do python scripts/add_document.py \ --file "$pdf" \ --citation "$(basename "$pdf" .pdf)" \ --vector-store-dir ./data/vector_store done # Then upload once python scripts/add_document.py \ --file dummy.pdf \ --vector-store-dir ./data/vector_store \ --upload \ --repo-id sniro23/VedaMD-Vector-Store \ --no-duplicate-check ``` ### Inspecting Vector Store ```bash # View config cat data/vector_store/config.json # View build log cat data/vector_store/build_log.json | python -m json.tool # Count documents python -c "import json; print(len(json.load(open('data/vector_store/documents.json'))))" # List sources python -c "import json; meta=json.load(open('data/vector_store/metadata.json')); print(set(m['source'] for m in meta))" ``` ### Backup Management Backups are created automatically in `data/vector_store/backups/`: ```bash # List backups ls -lh data/vector_store/backups/ # Restore from backup (if needed) cp data/vector_store/backups/20251023_150000/* data/vector_store/ ``` ### Quality Checks Check extraction quality for a specific PDF: ```python from scripts.build_vector_store import PDFExtractor text, metadata = PDFExtractor.extract_text("Obs/Breech.pdf") print(f"Extracted {len(text)} characters") print(f"Pages: {metadata['pages']}") print(f"Method: {metadata['method']}") print(f"\nFirst 500 chars:\n{text[:500]}") ``` --- ## Troubleshooting ### Issue: "No PDF files found" **Solution:** ```bash # Check directory exists ls -la ./Obs # Use absolute path python scripts/build_vector_store.py \ --input-dir "/Users/niro/Documents/SL Clinical Assistant/Obs" \ --output-dir ./data/vector_store ``` ### Issue: "Extracted text too short" **Causes:** - Scanned PDF (image-based) - Encrypted PDF - Corrupted PDF **Solution:** ```bash # Check PDF manually open Obs/problematic.pdf # Try with OCR (requires tesseract) pip install pytesseract # Script will auto-fallback to OCR ``` ### Issue: "Embedding dimension mismatch" **Solution:** ```bash # Check existing config cat data/vector_store/config.json # Rebuild with same model python scripts/build_vector_store.py \ --embedding-model "sentence-transformers/all-MiniLM-L6-v2" \ --input-dir ./Obs \ --output-dir ./data/vector_store ``` ### Issue: "Upload failed" **Solution:** ```bash # Check HF token echo $HF_TOKEN # Test token python -c "from huggingface_hub import HfApi; print(HfApi(token='$HF_TOKEN').whoami())" # Create repo first python -c "from huggingface_hub import create_repo; create_repo('sniro23/VedaMD-Vector-Store', repo_type='dataset', exist_ok=True)" ``` ### Issue: "Out of memory" **Solution:** ```bash # Reduce batch size in script (edit build_vector_store.py) # Line ~338: change batch_size=32 to batch_size=8 # Or process PDFs in smaller batches mkdir -p Obs/batch1 Obs/batch2 # Move PDFs into batches python scripts/build_vector_store.py --input-dir Obs/batch1 ... python scripts/add_document.py --file Obs/batch2/*.pdf ... ``` ### Issue: "Duplicate detected but I want to update" **Solution:** ```bash # Option 1: Force add (creates duplicate) python scripts/add_document.py \ --file ./updated.pdf \ --no-duplicate-check \ --vector-store-dir ./data/vector_store # Option 2: Rebuild from scratch python scripts/build_vector_store.py \ --input-dir ./Obs \ --output-dir ./data/vector_store ``` --- ## Best Practices ### 1. Organize Your PDFs ``` Obs/ ├── obstetrics/ │ ├── preeclampsia.pdf │ ├── hemorrhage.pdf │ └── ... ├── cardiology/ │ └── ... └── general/ └── ... ``` ### 2. Use Meaningful Citations ```bash # Good --citation "SLCOG Preeclampsia Management Guidelines 2025" # Bad --citation "guideline.pdf" ``` ### 3. Regular Backups ```bash # Before major changes cp -r data/vector_store data/vector_store_backup_$(date +%Y%m%d) ``` ### 4. Test Before Uploading ```bash # Build locally first python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./test_vs # Test with RAG system # Then upload python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload ``` ### 5. Version Control Add to `.gitignore`: ``` data/vector_store/ test_vector_store/ *.log backups/ ``` Keep in Git: ``` scripts/ Obs/ requirements.txt ``` --- ## Integration with VedaMD ### Using Your Vector Store After building, update your RAG system: ```python # In enhanced_groq_medical_rag.py or wherever vector store is loaded # Option 1: Load from local directory vector_store = SimpleVectorStore("./data/vector_store") # Option 2: Load from HF Hub vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store") ``` ### Automatic Reloading For production, reload vector store periodically: ```python import schedule import time def reload_vector_store(): global vector_store vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store") logger.info("✅ Vector store reloaded") # Reload every 6 hours schedule.every(6).hours.do(reload_vector_store) while True: schedule.run_pending() time.sleep(60) ``` --- ## Next Steps 1. **Build your initial vector store:** ```bash python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store ``` 2. **Upload to HF:** ```bash python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload --repo-id sniro23/VedaMD-Vector-Store ``` 3. **Test with RAG system:** ```bash python -c "from src.enhanced_groq_medical_rag import EnhancedGroqMedicalRAG; rag = EnhancedGroqMedicalRAG(); print(rag.query('What is preeclampsia?'))" ``` 4. **Add new documents as they arrive:** ```bash python scripts/add_document.py --file ./new.pdf --vector-store-dir ./data/vector_store --upload ``` --- **Questions or Issues?** Check the logs: - `vector_store_build.log` - Build process - `add_document.log` - Document additions Or review the scripts: - [scripts/build_vector_store.py](scripts/build_vector_store.py) - [scripts/add_document.py](scripts/add_document.py) --- **Last Updated**: October 23, 2025 **Version**: 1.0.0