# VedaMD Project Structure **Clean, organized codebase for production deployment** Last updated: October 23, 2025 --- ## Directory Structure ``` SL Clinical Assistant/ ├── app.py # Gradio interface (HF Spaces entry point) ├── requirements.txt # Python dependencies ├── .env.example # Environment variable template ├── .gitignore # Git ignore rules │ ├── src/ # Core application code │ ├── __init__.py │ ├── enhanced_groq_medical_rag.py # Main RAG system (Cerebras-powered) │ ├── enhanced_backend_api.py # FastAPI backend for frontend │ ├── simple_vector_store.py # Vector store loader │ ├── vector_store_compatibility.py # Compatibility wrapper (temporary) │ ├── enhanced_medical_context.py # Medical context enhancement │ └── medical_response_verifier.py # Response verification & safety │ ├── scripts/ # Automation scripts │ ├── build_vector_store.py # Build complete vector store from PDFs │ └── add_document.py # Add single document incrementally │ ├── frontend/ # Next.js frontend (separate deployment) │ ├── src/ │ │ ├── app/ │ │ ├── components/ │ │ └── lib/ │ │ └── api.ts # API client (FastAPI + Gradio support) │ ├── public/ │ ├── package.json │ └── .env.local.example │ ├── data/ # Data files (local only, not in git) │ ├── guidelines/ # Source PDF files (moved from Obs/) │ ├── vector_store/ # Built vector store (FAISS + metadata) │ │ ├── faiss_index.bin │ │ ├── documents.json │ │ ├── metadata.json │ │ ├── config.json │ │ └── backups/ # Automatic backups │ └── processed/ # Processed documents (optional) │ ├── docs/ # Documentation index │ └── README.md # Documentation directory index │ ├── archive/ # Old/deprecated files (not in git) │ ├── old_scripts/ # batch_ocr_pipeline.py, convert_pdf.py │ └── old_docs/ # output.md, cleanup_plan.md, etc. │ ├── test_pdfs/ # Test files (not in git) ├── test_vector_store/ # Test vector store (not in git) │ └── Documentation Files # Root-level docs ├── README.md # Main project README ├── PIPELINE_GUIDE.md # Document pipeline usage guide ├── LOCAL_TESTING_GUIDE.md # Local development guide ├── IMPROVEMENT_PLAN.md # Project roadmap ├── DEPLOYMENT.md # Deployment instructions ├── SECURITY_SETUP.md # Security configuration ├── CEREBRAS_MIGRATION_GUIDE.md # Cerebras migration details ├── QUICK_START_CEREBRAS.md # Cerebras quickstart ├── PRODUCTION_READINESS_REPORT.md # Production assessment ├── CHANGES_SUMMARY.md # Summary of changes └── CEREBRAS_SUMMARY.md # Cerebras integration summary ``` --- ## Core Files ### Application Entry Points | File | Purpose | Deployment | |------|---------|------------| | `app.py` | Gradio interface | Hugging Face Spaces | | `src/enhanced_backend_api.py` | FastAPI REST API | Hugging Face Spaces (port 7862) | | `frontend/` | Next.js frontend | Netlify / Vercel | ### RAG System | File | Purpose | Key Features | |------|---------|--------------| | `src/enhanced_groq_medical_rag.py` | Main RAG orchestrator | Cerebras integration, multi-stage retrieval, medical safety | | `src/simple_vector_store.py` | Vector store loader | HF Hub download, FAISS search | | `src/enhanced_medical_context.py` | Medical context enhancement | Entity extraction, relevance scoring | | `src/medical_response_verifier.py` | Response verification | Claim validation, source traceability | ### Automation Scripts | Script | Purpose | Usage | |--------|---------|-------| | `scripts/build_vector_store.py` | Build complete vector store | `python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store --upload` | | `scripts/add_document.py` | Add single document | `python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store --upload` | ### Startup Scripts | Script | Purpose | |--------|---------| | `run_backend.sh` | Start FastAPI backend (port 7862) | | `run_frontend.sh` | Start Next.js frontend (port 3000) | | `kill_backend.sh` | Stop backend processes | --- ## Data Files ### Vector Store Files (data/vector_store/) Generated by `build_vector_store.py`: | File | Purpose | Format | |------|---------|--------| | `faiss_index.bin` | FAISS vector index | Binary | | `documents.json` | Document chunks | JSON array of strings | | `metadata.json` | Document metadata | JSON array of objects | | `config.json` | Build configuration | JSON object | | `build_log.json` | Build information | JSON object | **Metadata Structure:** ```json { "source": "guideline.pdf", "section": "Management", "chunk_id": 0, "chunk_size": 1000, "file_hash": "a3f2c9d8...", "extraction_method": "pymupdf", "total_pages": 15, "citation": "SLCOG Guidelines 2025", "category": "Obstetrics", "processed_at": "2025-10-23T15:08:30.273544" } ``` --- ## Configuration Files ### Environment Variables **.env** (local development): ```bash CEREBRAS_API_KEY=csk_your_key_here HF_TOKEN=hf_your_token_here # For uploading vector store ``` **Hugging Face Spaces Secrets:** ``` CEREBRAS_API_KEY # Required HF_TOKEN # Optional (for vector store upload) ALLOWED_ORIGINS # Optional (CORS, comma-separated) ``` ### Requirements **requirements.txt** - Python dependencies: - cerebras-cloud-sdk - Cerebras API client - gradio - Web interface - fastapi - REST API - sentence-transformers - Embeddings - faiss-cpu - Vector search - huggingface-hub - Model/data hosting - PyMuPDF, pdfplumber - PDF extraction --- ## Git Ignore Strategy ### Ignored (Local Only) - `data/guidelines/` - Source PDFs - `data/vector_store/` - Built vector store - `archive/` - Old files - `test_pdfs/`, `test_vector_store/` - Test files - `frontend/` - Separate deployment - `.env` - Local environment variables - `*.log` - Log files ### Committed (Version Control) - `src/` - Application code - `scripts/` - Automation scripts - `app.py` - Gradio entry point - `requirements.txt` - Dependencies - `.env.example` - Environment template - `*.md` - Documentation --- ## Workflow ### Development Workflow 1. **Add new guideline:** ```bash cp ~/Downloads/new_guideline.pdf data/guidelines/ ``` 2. **Update vector store:** ```bash python scripts/add_document.py \ --file data/guidelines/new_guideline.pdf \ --citation "SLCOG Guidelines 2025" \ --vector-store-dir ./data/vector_store ``` 3. **Test locally:** ```bash # Terminal 1: Start backend ./run_backend.sh # Terminal 2: Start frontend ./run_frontend.sh # Or just test Gradio python app.py ``` 4. **Deploy to production:** ```bash # Upload vector store to HF Hub python scripts/build_vector_store.py \ --input-dir ./data/guidelines \ --output-dir ./data/vector_store \ --upload --repo-id sniro23/VedaMD-Vector-Store # Push code to HF Spaces git add src/ app.py requirements.txt git commit -m "Update: Add new guidelines" git push origin main ``` ### Production Deployment **Backend (Hugging Face Spaces):** - Gradio interface: Automatic from `app.py` - FastAPI API: Runs on port 7862 - Vector store: Downloaded from HF Hub on startup - Secrets: Set in HF Spaces settings **Frontend (Netlify):** - Build: `cd frontend && npm run build` - Deploy: Automatic from GitHub - Environment: `NEXT_PUBLIC_API_URL=https://sniro23-vedamd-enhanced.hf.space` --- ## Migration Notes ### From Old Structure **Moved:** - `Obs/*.pdf` → `data/guidelines/*.pdf` - Vector store logic remains in `src/` **Archived:** - `batch_ocr_pipeline.py` → `archive/old_scripts/` - `convert_pdf.py` → `archive/old_scripts/` - `output*.md` → `archive/old_docs/` - `cleanup_plan.md` → `archive/old_docs/` **Created New:** - `scripts/` - Automation scripts - `data/` - Data directory structure - `docs/` - Documentation index - `archive/` - Old files --- ## Key Improvements ### Before Cleanup ``` SL Clinical Assistant/ ├── app.py ├── src/ ├── Obs/ # Unclear name ├── batch_ocr_pipeline.py # Old script at root ├── convert_pdf.py # Old script at root ├── output.md # Temporary file ├── output_new.md # Temporary file └── 15+ .md files at root # Disorganized docs ``` ### After Cleanup ``` SL Clinical Assistant/ ├── app.py # Clear entry point ├── src/ # Core code ├── scripts/ # Automation scripts ├── data/ # Data files │ ├── guidelines/ # Clear purpose │ └── vector_store/ # Clear purpose ├── docs/ # Documentation index ├── archive/ # Old files preserved └── Documentation files # Organized at root ``` --- ## Best Practices ### Code Organization 1. **Core Logic**: Keep in `src/` 2. **Automation**: Keep in `scripts/` 3. **Data**: Keep in `data/` (gitignored) 4. **Tests**: Keep in `tests/` (if created) ### Documentation 1. **User Guides**: Root level (PIPELINE_GUIDE.md, etc.) 2. **Technical Docs**: Root level (DEPLOYMENT.md, etc.) 3. **Code Docs**: Inline docstrings in Python files 4. **Index**: `docs/README.md` for navigation ### Data Management 1. **Source Data**: `data/guidelines/` 2. **Processed Data**: `data/vector_store/` 3. **Backups**: Automatic in `data/vector_store/backups/` 4. **Test Data**: `test_pdfs/`, `test_vector_store/` ### Version Control 1. **Commit Code**: `src/`, `scripts/`, `app.py` 2. **Ignore Data**: `data/`, `archive/`, `test_*/` 3. **Commit Docs**: All `.md` files 4. **Templates**: `.env.example`, not `.env` --- ## Quick Reference ### Common Commands ```bash # Build vector store from scratch python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store # Add single document python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store # Start backend ./run_backend.sh # Start frontend ./run_frontend.sh # Test Gradio interface python app.py # Upload to HF Hub python scripts/build_vector_store.py ... --upload --repo-id sniro23/VedaMD-Vector-Store ``` ### Important Paths - **PDFs**: `data/guidelines/` - **Vector Store**: `data/vector_store/` - **RAG System**: `src/enhanced_groq_medical_rag.py` - **API**: `src/enhanced_backend_api.py` - **Scripts**: `scripts/` - **Docs**: Root level + `docs/README.md` --- **Clean codebase = Maintainable codebase = Production-ready codebase**