# VedaMD Document Pipeline Guide

**Complete guide for adding and managing medical documents in VedaMD**

---

## Table of Contents

1. [Overview](#overview)
2. [Quick Start](#quick-start)
3. [Building Vector Store from Scratch](#building-vector-store-from-scratch)
4. [Adding Single Documents](#adding-single-documents)
5. [Updating Existing Documents](#updating-existing-documents)
6. [Uploading to Hugging Face](#uploading-to-hugging-face)
7. [Advanced Usage](#advanced-usage)
8. [Troubleshooting](#troubleshooting)

---

## Overview

### What is the Pipeline?

The VedaMD pipeline automates the process of converting medical PDF documents into a searchable vector store that powers the RAG system.

**Before Pipeline** (Manual Process):
```
PDF → Extract Text → Chunk → Embed → Build FAISS → Upload to HF
  ↓        ↓          ↓        ↓         ↓            ↓
Hours    Manual     Script   Script   External    Manual
         Work       Needed   Needed    Tool       Upload
```

**With Pipeline** (Automated):
```
PDF → python add_document.py file.pdf → Done ✅
  ↓
Minutes
```

### Pipeline Components

1. **build_vector_store.py** - Build complete vector store from directory of PDFs
2. **add_document.py** - Add single documents to existing vector store
3. **Automatic Features**:
   - PDF text extraction (PyMuPDF, pdfplumber, OCR fallback)
   - Smart medical chunking
   - Duplicate detection
   - Quality validation
   - HF Hub integration
   - Automatic backups

---

## Quick Start

### Prerequisites

All required packages are already installed in your `.venv`:
- ✅ PyMuPDF (PDF extraction)
- ✅ pdfplumber (backup PDF extraction)
- ✅ sentence-transformers (embeddings)
- ✅ faiss-cpu (vector indexing)
- ✅ huggingface-hub (uploading)

### 30-Second Test

```bash
# Activate environment
cd "/Users/niro/Documents/SL Clinical Assistant"
source .venv/bin/activate

# Build vector store from your existing PDFs
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store

# That's it! ✅
```

---

## Building Vector Store from Scratch

### Basic Usage

Build a vector store from all PDFs in a directory:

```bash
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store
```

**Expected output:**
```
🚀 STARTING VECTOR STORE BUILD
============================================================

🔍 Scanning for PDFs in Obs
✅ Found 15 PDF files
  📄 Breech.pdf
  📄 RhESUS.pdf
  ... (13 more)

============================================================
📄 Processing: Breech.pdf
============================================================
📄 Extracting with PyMuPDF: Obs/Breech.pdf
✅ Extracted 1988 characters from 1 pages
📝 Chunking text from Breech.pdf
✅ Created 2 chunks from Breech.pdf
🧮 Generating embeddings for 2 chunks...
✅ Processed Breech.pdf: 2 chunks added

... (processes all PDFs)

============================================================
✅ BUILD COMPLETE!
============================================================
📊 Summary:
  • PDFs processed: 15
  • Total chunks: 247
  • Embedding dimension: 384
  • Output directory: ./data/vector_store
  • Build time: 45.23 seconds
============================================================
```

### Customizing Chunk Size

For longer/shorter chunks:

```bash
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --chunk-size 1500 \
  --chunk-overlap 150
```

**Recommendations:**
- **chunk-size**: 800-1200 (default: 1000)
- **chunk-overlap**: 50-200 (default: 100)
- Smaller chunks = more precise retrieval
- Larger chunks = better context

### Using Different Embedding Model

```bash
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --embedding-model "sentence-transformers/all-mpnet-base-v2"
```

**Available models:**
- `all-MiniLM-L6-v2` (default) - Fast, 384d, good quality
- `all-mpnet-base-v2` - Better quality, 768d, slower
- `multi-qa-mpnet-base-dot-v1` - Optimized for Q&A

### Build and Upload to HF

```bash
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store
```

**Note**: Requires `HF_TOKEN` environment variable or `--hf-token` argument

---

## Adding Single Documents

### Basic Usage

Add a new guideline to existing vector store:

```bash
python scripts/add_document.py \
  --file ./new_guideline.pdf \
  --citation "SLCOG Hypertension Guidelines 2025" \
  --category "Obstetrics" \
  --vector-store-dir ./data/vector_store
```

**Expected output:**
```
============================================================
📄 Adding document: new_guideline.pdf
============================================================
📄 Extracting with PyMuPDF: ./new_guideline.pdf
✅ Extracted 12,456 characters from 8 pages
🔑 File hash: a3f2c9d8e1b0...
🔍 Checking for duplicates...
✅ No duplicates found
📝 Created 14 chunks
🧮 Generating embeddings...
📊 Adding to FAISS index...
✅ Added 14 chunks to vector store
📊 New total: 261 vectors

============================================================
💾 Saving updated vector store...
============================================================
📦 Backup created: data/vector_store/backups/20251023_150000
✅ Saved FAISS index
✅ Saved documents
✅ Saved metadata
✅ Updated config

============================================================
✅ DOCUMENT ADDED SUCCESSFULLY!
============================================================
📊 Summary:
  • Chunks added: 14
  • Total vectors: 261
  • Time taken: 8.43 seconds
============================================================
```

### Add and Upload to HF

```bash
python scripts/add_document.py \
  --file ./new_guideline.pdf \
  --citation "WHO Guidelines 2025" \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store
```

### Allow Duplicates

By default, duplicate detection is enabled. To force add:

```bash
python scripts/add_document.py \
  --file ./updated_guideline.pdf \
  --vector-store-dir ./data/vector_store \
  --no-duplicate-check
```

---

## Updating Existing Documents

To update an existing guideline:

1. **Add new version** (recommended):
```bash
python scripts/add_document.py \
  --file ./guidelines_v2.pdf \
  --citation "SLCOG Hypertension Guidelines 2025 v2" \
  --vector-store-dir ./data/vector_store
```

2. **Rebuild from scratch** (if major changes):
```bash
# Move old PDFs to archive
mkdir -p Obs/archive
mv Obs/old_guideline.pdf Obs/archive/

# Add new version
cp ~/Downloads/new_guideline.pdf Obs/

# Rebuild
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store
```

---

## Uploading to Hugging Face

### Setup HF Token

```bash
# Option 1: Environment variable (recommended)
export HF_TOKEN="hf_your_token_here"

# Option 2: Pass as argument
python scripts/build_vector_store.py --hf-token "hf_your_token_here" ...
```

### Initial Upload

```bash
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store
```

### Incremental Upload

After adding a document:

```bash
python scripts/add_document.py \
  --file ./new.pdf \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store
```

### What Gets Uploaded

- ✅ `faiss_index.bin` - FAISS vector index
- ✅ `documents.json` - Document chunks
- ✅ `metadata.json` - Citations, sources, sections
- ✅ `config.json` - Configuration settings
- ✅ `build_log.json` - Build information

---

## Advanced Usage

### Batch Processing Multiple Files

```bash
# Create a script to add multiple files
for pdf in new_guidelines/*.pdf; do
  python scripts/add_document.py \
    --file "$pdf" \
    --citation "$(basename "$pdf" .pdf)" \
    --vector-store-dir ./data/vector_store
done

# Then upload once
python scripts/add_document.py \
  --file dummy.pdf \
  --vector-store-dir ./data/vector_store \
  --upload \
  --repo-id sniro23/VedaMD-Vector-Store \
  --no-duplicate-check
```

### Inspecting Vector Store

```bash
# View config
cat data/vector_store/config.json

# View build log
cat data/vector_store/build_log.json | python -m json.tool

# Count documents
python -c "import json; print(len(json.load(open('data/vector_store/documents.json'))))"

# List sources
python -c "import json; meta=json.load(open('data/vector_store/metadata.json')); print(set(m['source'] for m in meta))"
```

### Backup Management

Backups are created automatically in `data/vector_store/backups/`:

```bash
# List backups
ls -lh data/vector_store/backups/

# Restore from backup (if needed)
cp data/vector_store/backups/20251023_150000/* data/vector_store/
```

### Quality Checks

Check extraction quality for a specific PDF:

```python
from scripts.build_vector_store import PDFExtractor

text, metadata = PDFExtractor.extract_text("Obs/Breech.pdf")
print(f"Extracted {len(text)} characters")
print(f"Pages: {metadata['pages']}")
print(f"Method: {metadata['method']}")
print(f"\nFirst 500 chars:\n{text[:500]}")
```

---

## Troubleshooting

### Issue: "No PDF files found"

**Solution:**
```bash
# Check directory exists
ls -la ./Obs

# Use absolute path
python scripts/build_vector_store.py \
  --input-dir "/Users/niro/Documents/SL Clinical Assistant/Obs" \
  --output-dir ./data/vector_store
```

### Issue: "Extracted text too short"

**Causes:**
- Scanned PDF (image-based)
- Encrypted PDF
- Corrupted PDF

**Solution:**
```bash
# Check PDF manually
open Obs/problematic.pdf

# Try with OCR (requires tesseract)
pip install pytesseract
# Script will auto-fallback to OCR
```

### Issue: "Embedding dimension mismatch"

**Solution:**
```bash
# Check existing config
cat data/vector_store/config.json

# Rebuild with same model
python scripts/build_vector_store.py \
  --embedding-model "sentence-transformers/all-MiniLM-L6-v2" \
  --input-dir ./Obs \
  --output-dir ./data/vector_store
```

### Issue: "Upload failed"

**Solution:**
```bash
# Check HF token
echo $HF_TOKEN

# Test token
python -c "from huggingface_hub import HfApi; print(HfApi(token='$HF_TOKEN').whoami())"

# Create repo first
python -c "from huggingface_hub import create_repo; create_repo('sniro23/VedaMD-Vector-Store', repo_type='dataset', exist_ok=True)"
```

### Issue: "Out of memory"

**Solution:**
```bash
# Reduce batch size in script (edit build_vector_store.py)
# Line ~338: change batch_size=32 to batch_size=8

# Or process PDFs in smaller batches
mkdir -p Obs/batch1 Obs/batch2
# Move PDFs into batches
python scripts/build_vector_store.py --input-dir Obs/batch1 ...
python scripts/add_document.py --file Obs/batch2/*.pdf ...
```

### Issue: "Duplicate detected but I want to update"

**Solution:**
```bash
# Option 1: Force add (creates duplicate)
python scripts/add_document.py \
  --file ./updated.pdf \
  --no-duplicate-check \
  --vector-store-dir ./data/vector_store

# Option 2: Rebuild from scratch
python scripts/build_vector_store.py \
  --input-dir ./Obs \
  --output-dir ./data/vector_store
```

---

## Best Practices

### 1. Organize Your PDFs

```
Obs/
├── obstetrics/
│   ├── preeclampsia.pdf
│   ├── hemorrhage.pdf
│   └── ...
├── cardiology/
│   └── ...
└── general/
    └── ...
```

### 2. Use Meaningful Citations

```bash
# Good
--citation "SLCOG Preeclampsia Management Guidelines 2025"

# Bad
--citation "guideline.pdf"
```

### 3. Regular Backups

```bash
# Before major changes
cp -r data/vector_store data/vector_store_backup_$(date +%Y%m%d)
```

### 4. Test Before Uploading

```bash
# Build locally first
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./test_vs

# Test with RAG system
# Then upload
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload
```

### 5. Version Control

Add to `.gitignore`:
```
data/vector_store/
test_vector_store/
*.log
backups/
```

Keep in Git:
```
scripts/
Obs/
requirements.txt
```

---

## Integration with VedaMD

### Using Your Vector Store

After building, update your RAG system:

```python
# In enhanced_groq_medical_rag.py or wherever vector store is loaded

# Option 1: Load from local directory
vector_store = SimpleVectorStore("./data/vector_store")

# Option 2: Load from HF Hub
vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
```

### Automatic Reloading

For production, reload vector store periodically:

```python
import schedule
import time

def reload_vector_store():
    global vector_store
    vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
    logger.info("✅ Vector store reloaded")

# Reload every 6 hours
schedule.every(6).hours.do(reload_vector_store)

while True:
    schedule.run_pending()
    time.sleep(60)
```

---

## Next Steps

1. **Build your initial vector store:**
   ```bash
   python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store
   ```

2. **Upload to HF:**
   ```bash
   python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload --repo-id sniro23/VedaMD-Vector-Store
   ```

3. **Test with RAG system:**
   ```bash
   python -c "from src.enhanced_groq_medical_rag import EnhancedGroqMedicalRAG; rag = EnhancedGroqMedicalRAG(); print(rag.query('What is preeclampsia?'))"
   ```

4. **Add new documents as they arrive:**
   ```bash
   python scripts/add_document.py --file ./new.pdf --vector-store-dir ./data/vector_store --upload
   ```

---

**Questions or Issues?**

Check the logs:
- `vector_store_build.log` - Build process
- `add_document.log` - Document additions

Or review the scripts:
- [scripts/build_vector_store.py](scripts/build_vector_store.py)
- [scripts/add_document.py](scripts/add_document.py)

---

**Last Updated**: October 23, 2025
**Version**: 1.0.0