Spaces:

sniro23
/

VedaMD-Backend-v2

Sleeping

File size: 11,455 Bytes

b4971bd

# VedaMD Project Structure

**Clean, organized codebase for production deployment**

Last updated: October 23, 2025

---

## Directory Structure

```
SL Clinical Assistant/
├── app.py                          # Gradio interface (HF Spaces entry point)
├── requirements.txt                # Python dependencies
├── .env.example                    # Environment variable template
├── .gitignore                      # Git ignore rules
│
├── src/                            # Core application code
│   ├── __init__.py
│   ├── enhanced_groq_medical_rag.py       # Main RAG system (Cerebras-powered)
│   ├── enhanced_backend_api.py            # FastAPI backend for frontend
│   ├── simple_vector_store.py             # Vector store loader
│   ├── vector_store_compatibility.py      # Compatibility wrapper (temporary)
│   ├── enhanced_medical_context.py        # Medical context enhancement
│   └── medical_response_verifier.py       # Response verification & safety
│
├── scripts/                        # Automation scripts
│   ├── build_vector_store.py      # Build complete vector store from PDFs
│   └── add_document.py             # Add single document incrementally
│
├── frontend/                       # Next.js frontend (separate deployment)
│   ├── src/
│   │   ├── app/
│   │   ├── components/
│   │   └── lib/
│   │       └── api.ts              # API client (FastAPI + Gradio support)
│   ├── public/
│   ├── package.json
│   └── .env.local.example
│
├── data/                           # Data files (local only, not in git)
│   ├── guidelines/                 # Source PDF files (moved from Obs/)
│   ├── vector_store/               # Built vector store (FAISS + metadata)
│   │   ├── faiss_index.bin
│   │   ├── documents.json
│   │   ├── metadata.json
│   │   ├── config.json
│   │   └── backups/                # Automatic backups
│   └── processed/                  # Processed documents (optional)
│
├── docs/                           # Documentation index
│   └── README.md                   # Documentation directory index
│
├── archive/                        # Old/deprecated files (not in git)
│   ├── old_scripts/                # batch_ocr_pipeline.py, convert_pdf.py
│   └── old_docs/                   # output.md, cleanup_plan.md, etc.
│
├── test_pdfs/                      # Test files (not in git)
├── test_vector_store/              # Test vector store (not in git)
│
└── Documentation Files             # Root-level docs
    ├── README.md                   # Main project README
    ├── PIPELINE_GUIDE.md           # Document pipeline usage guide
    ├── LOCAL_TESTING_GUIDE.md      # Local development guide
    ├── IMPROVEMENT_PLAN.md         # Project roadmap
    ├── DEPLOYMENT.md               # Deployment instructions
    ├── SECURITY_SETUP.md           # Security configuration
    ├── CEREBRAS_MIGRATION_GUIDE.md # Cerebras migration details
    ├── QUICK_START_CEREBRAS.md     # Cerebras quickstart
    ├── PRODUCTION_READINESS_REPORT.md  # Production assessment
    ├── CHANGES_SUMMARY.md          # Summary of changes
    └── CEREBRAS_SUMMARY.md         # Cerebras integration summary
```

---

## Core Files

### Application Entry Points

| File | Purpose | Deployment |
|------|---------|------------|
| `app.py` | Gradio interface | Hugging Face Spaces |
| `src/enhanced_backend_api.py` | FastAPI REST API | Hugging Face Spaces (port 7862) |
| `frontend/` | Next.js frontend | Netlify / Vercel |

### RAG System

| File | Purpose | Key Features |
|------|---------|--------------|
| `src/enhanced_groq_medical_rag.py` | Main RAG orchestrator | Cerebras integration, multi-stage retrieval, medical safety |
| `src/simple_vector_store.py` | Vector store loader | HF Hub download, FAISS search |
| `src/enhanced_medical_context.py` | Medical context enhancement | Entity extraction, relevance scoring |
| `src/medical_response_verifier.py` | Response verification | Claim validation, source traceability |

### Automation Scripts

| Script | Purpose | Usage |
|--------|---------|-------|
| `scripts/build_vector_store.py` | Build complete vector store | `python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store --upload` |
| `scripts/add_document.py` | Add single document | `python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store --upload` |

### Startup Scripts

| Script | Purpose |
|--------|---------|
| `run_backend.sh` | Start FastAPI backend (port 7862) |
| `run_frontend.sh` | Start Next.js frontend (port 3000) |
| `kill_backend.sh` | Stop backend processes |

---

## Data Files

### Vector Store Files (data/vector_store/)

Generated by `build_vector_store.py`:

| File | Purpose | Format |
|------|---------|--------|
| `faiss_index.bin` | FAISS vector index | Binary |
| `documents.json` | Document chunks | JSON array of strings |
| `metadata.json` | Document metadata | JSON array of objects |
| `config.json` | Build configuration | JSON object |
| `build_log.json` | Build information | JSON object |

**Metadata Structure:**
```json
{
  "source": "guideline.pdf",
  "section": "Management",
  "chunk_id": 0,
  "chunk_size": 1000,
  "file_hash": "a3f2c9d8...",
  "extraction_method": "pymupdf",
  "total_pages": 15,
  "citation": "SLCOG Guidelines 2025",
  "category": "Obstetrics",
  "processed_at": "2025-10-23T15:08:30.273544"
}
```

---

## Configuration Files

### Environment Variables

**.env** (local development):
```bash
CEREBRAS_API_KEY=csk_your_key_here
HF_TOKEN=hf_your_token_here  # For uploading vector store
```

**Hugging Face Spaces Secrets:**
```
CEREBRAS_API_KEY  # Required
HF_TOKEN          # Optional (for vector store upload)
ALLOWED_ORIGINS   # Optional (CORS, comma-separated)
```

### Requirements

**requirements.txt** - Python dependencies:
- cerebras-cloud-sdk - Cerebras API client
- gradio - Web interface
- fastapi - REST API
- sentence-transformers - Embeddings
- faiss-cpu - Vector search
- huggingface-hub - Model/data hosting
- PyMuPDF, pdfplumber - PDF extraction

---

## Git Ignore Strategy

### Ignored (Local Only)

- `data/guidelines/` - Source PDFs
- `data/vector_store/` - Built vector store
- `archive/` - Old files
- `test_pdfs/`, `test_vector_store/` - Test files
- `frontend/` - Separate deployment
- `.env` - Local environment variables
- `*.log` - Log files

### Committed (Version Control)

- `src/` - Application code
- `scripts/` - Automation scripts
- `app.py` - Gradio entry point
- `requirements.txt` - Dependencies
- `.env.example` - Environment template
- `*.md` - Documentation

---

## Workflow

### Development Workflow

1. **Add new guideline:**
   ```bash
   cp ~/Downloads/new_guideline.pdf data/guidelines/
   ```

2. **Update vector store:**
   ```bash
   python scripts/add_document.py \
     --file data/guidelines/new_guideline.pdf \
     --citation "SLCOG Guidelines 2025" \
     --vector-store-dir ./data/vector_store
   ```

3. **Test locally:**
   ```bash
   # Terminal 1: Start backend
   ./run_backend.sh

   # Terminal 2: Start frontend
   ./run_frontend.sh

   # Or just test Gradio
   python app.py
   ```

4. **Deploy to production:**
   ```bash
   # Upload vector store to HF Hub
   python scripts/build_vector_store.py \
     --input-dir ./data/guidelines \
     --output-dir ./data/vector_store \
     --upload --repo-id sniro23/VedaMD-Vector-Store

   # Push code to HF Spaces
   git add src/ app.py requirements.txt
   git commit -m "Update: Add new guidelines"
   git push origin main
   ```

### Production Deployment

**Backend (Hugging Face Spaces):**
- Gradio interface: Automatic from `app.py`
- FastAPI API: Runs on port 7862
- Vector store: Downloaded from HF Hub on startup
- Secrets: Set in HF Spaces settings

**Frontend (Netlify):**
- Build: `cd frontend && npm run build`
- Deploy: Automatic from GitHub
- Environment: `NEXT_PUBLIC_API_URL=https://sniro23-vedamd-enhanced.hf.space`

---

## Migration Notes

### From Old Structure

**Moved:**
- `Obs/*.pdf` → `data/guidelines/*.pdf`
- Vector store logic remains in `src/`

**Archived:**
- `batch_ocr_pipeline.py` → `archive/old_scripts/`
- `convert_pdf.py` → `archive/old_scripts/`
- `output*.md` → `archive/old_docs/`
- `cleanup_plan.md` → `archive/old_docs/`

**Created New:**
- `scripts/` - Automation scripts
- `data/` - Data directory structure
- `docs/` - Documentation index
- `archive/` - Old files

---

## Key Improvements

### Before Cleanup
```
SL Clinical Assistant/
├── app.py
├── src/
├── Obs/                    # Unclear name
├── batch_ocr_pipeline.py   # Old script at root
├── convert_pdf.py          # Old script at root
├── output.md               # Temporary file
├── output_new.md           # Temporary file
└── 15+ .md files at root   # Disorganized docs
```

### After Cleanup
```
SL Clinical Assistant/
├── app.py                  # Clear entry point
├── src/                    # Core code
├── scripts/                # Automation scripts
├── data/                   # Data files
│   ├── guidelines/         # Clear purpose
│   └── vector_store/       # Clear purpose
├── docs/                   # Documentation index
├── archive/                # Old files preserved
└── Documentation files     # Organized at root
```

---

## Best Practices

### Code Organization

1. **Core Logic**: Keep in `src/`
2. **Automation**: Keep in `scripts/`
3. **Data**: Keep in `data/` (gitignored)
4. **Tests**: Keep in `tests/` (if created)

### Documentation

1. **User Guides**: Root level (PIPELINE_GUIDE.md, etc.)
2. **Technical Docs**: Root level (DEPLOYMENT.md, etc.)
3. **Code Docs**: Inline docstrings in Python files
4. **Index**: `docs/README.md` for navigation

### Data Management

1. **Source Data**: `data/guidelines/`
2. **Processed Data**: `data/vector_store/`
3. **Backups**: Automatic in `data/vector_store/backups/`
4. **Test Data**: `test_pdfs/`, `test_vector_store/`

### Version Control

1. **Commit Code**: `src/`, `scripts/`, `app.py`
2. **Ignore Data**: `data/`, `archive/`, `test_*/`
3. **Commit Docs**: All `.md` files
4. **Templates**: `.env.example`, not `.env`

---

## Quick Reference

### Common Commands

```bash
# Build vector store from scratch
python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store

# Add single document
python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store

# Start backend
./run_backend.sh

# Start frontend
./run_frontend.sh

# Test Gradio interface
python app.py

# Upload to HF Hub
python scripts/build_vector_store.py ... --upload --repo-id sniro23/VedaMD-Vector-Store
```

### Important Paths

- **PDFs**: `data/guidelines/`
- **Vector Store**: `data/vector_store/`
- **RAG System**: `src/enhanced_groq_medical_rag.py`
- **API**: `src/enhanced_backend_api.py`
- **Scripts**: `scripts/`
- **Docs**: Root level + `docs/README.md`

---

**Clean codebase = Maintainable codebase = Production-ready codebase**