dragonllm-finance-models / docs /project-rules.md
jeanbaptdzd's picture
feat: Clean deployment to HuggingFace Space with model config test endpoint
8c0b652

LinguaCustodia Project Rules & Guidelines

Version: 24.1.0
Last Updated: October 6, 2025
Status: βœ… Production Ready


πŸ”‘ GOLDEN RULES - NEVER CHANGE

1. Environment Variables (MANDATORY)

# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here    # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here      # For HF repo access and Pro features
MODEL_NAME=qwen3-8b                           # Default model selection
DEPLOYMENT_ENV=huggingface                    # Platform configuration

Critical Rules:

  • βœ… HF_TOKEN_LC: For accessing private LinguaCustodia models
  • βœ… HF_TOKEN: For HuggingFace Pro account features (endpoints, Spaces, etc.)
  • βœ… Always load from .env: from dotenv import load_dotenv; load_dotenv()

2. Model Reloading (vLLM Limitation)

# vLLM does not support hot swaps - service restart required
# Solution: Implemented service restart mechanism via /load-model endpoint
# Process: Clear GPU memory β†’ Restart service β†’ Load new model

Critical Rules:

  • ❌ vLLM does not support hot swaps
  • βœ… We need to reload because vLLM does not support hot swaps
  • βœ… Service restart mechanism implemented for model switching

3. OpenAI Standard Interface

# We expose OpenAI standard interface
# Endpoints: /v1/chat/completions, /v1/completions, /v1/models
# Full compatibility for easy integration

Critical Rules:

  • βœ… We expose OpenAI standard interface
  • βœ… Full OpenAI API compatibility
  • βœ… Standard endpoints for easy integration

🚫 NEVER DO THESE

❌ Token Usage Mistakes

  1. NEVER use HF_TOKEN for LinguaCustodia model access (use HF_TOKEN_LC)
  2. NEVER use HF_TOKEN_LC for HuggingFace Pro features (use HF_TOKEN)
  3. NEVER hardcode tokens in code (always use environment variables)

❌ Model Loading Mistakes

  1. NEVER try to hot-swap models with vLLM (service restart required)
  2. NEVER use 12B+ models on L40 GPU (memory allocation fails)
  3. NEVER skip GPU memory cleanup during model switching

❌ Deployment Mistakes

  1. NEVER skip virtual environment activation
  2. NEVER use global Python installations
  3. NEVER forget to load environment variables from .env
  4. NEVER attempt local implementation or testing (local machine is weak)

βœ… ALWAYS DO THESE

βœ… Environment Setup

# ALWAYS activate virtual environment first
cd /Users/jeanbapt/Dragon-fin && source venv/bin/activate

# ALWAYS load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

βœ… Authentication

# ALWAYS use correct tokens for their purposes
hf_token_lc = os.getenv('HF_TOKEN_LC')  # For LinguaCustodia models
hf_token = os.getenv('HF_TOKEN')        # For HuggingFace Pro features

# ALWAYS authenticate before accessing models
from huggingface_hub import login
login(token=hf_token_lc)  # For model access

βœ… Model Configuration

# ALWAYS use these exact parameters for LinguaCustodia models
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token_lc,
    torch_dtype=torch.bfloat16,  # CONFIRMED: All models use bf16
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True
)

πŸ“Š Current Production Configuration

βœ… Space Configuration

βœ… API Endpoints

  • Standard: /, /health, /inference, /docs, /load-model, /models, /backend
  • OpenAI-compatible: /v1/chat/completions, /v1/completions, /v1/models
  • Analytics: /analytics/performance, /analytics/costs, /analytics/usage

βœ… Model Compatibility

  • L40 GPU Compatible: Llama 3.1 8B, Qwen 3 8B, Fin-Pythia 1.4B
  • L40 GPU Incompatible: Gemma 3 12B, Llama 3.1 70B (too large)

βœ… Storage Strategy

  • Persistent Storage: /data/.huggingface (150GB)
  • Automatic Fallback: ~/.cache/huggingface if persistent unavailable
  • Cache Preservation: Disk cache never cleared (only GPU memory)

πŸ”§ Model Loading Rules

βœ… Three-Tier Caching Strategy

  1. First Load: Downloads and caches to persistent storage
  2. Same Model: Reuses loaded model in memory (instant)
  3. Model Switch: Clears GPU memory, loads from disk cache

βœ… Memory Management

def cleanup_model_memory():
    # Delete Python objects
    del pipe, model, tokenizer
    
    # Clear GPU cache
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    
    # Force garbage collection
    gc.collect()
    
    # Disk cache PRESERVED for fast reloading

βœ… Model Switching Process

  1. Clear GPU Memory: Remove current model from GPU
  2. Service Restart: Required for vLLM model switching
  3. Load New Model: From disk cache or download
  4. Initialize vLLM Engine: With new model configuration

🎯 L40 GPU Limitations

βœ… Compatible Models (Recommended)

  • Llama 3.1 8B: ~24GB total memory usage
  • Qwen 3 8B: ~24GB total memory usage
  • Fin-Pythia 1.4B: ~6GB total memory usage

❌ Incompatible Models

  • Gemma 3 12B: ~45GB needed (exceeds 48GB L40 capacity)
  • Llama 3.1 70B: ~80GB needed (exceeds 48GB L40 capacity)

πŸ” Memory Analysis

8B Models (Working):
Model weights:        ~16GB βœ…
KV caches:           ~8GB  βœ…
Inference buffers:   ~4GB  βœ…
System overhead:     ~2GB  βœ…
Total used:          ~30GB (fits comfortably)

12B+ Models (Failing):
Model weights:        ~22GB βœ… (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)

πŸš€ Deployment Rules

βœ… HuggingFace Spaces

  • Use Docker SDK: With proper user setup (ID 1000)
  • Set hardware: L40 GPU for optimal performance
  • Use port 7860: HuggingFace standard
  • Include --chown=user: For file permissions in Dockerfile
  • Set HF_HOME=/data/.huggingface: For persistent storage
  • Use 150GB+ persistent storage: For model caching

βœ… Environment Variables

# Required in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface

βœ… Docker Configuration

# Use python -m uvicorn instead of uvicorn directly
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

# Include --chown=user for file permissions
COPY --chown=user:user . /app

πŸ§ͺ Testing Rules

βœ… Always Test in This Order

# 1. Test health endpoint
curl https://your-api-url.hf.space/health

# 2. Test model switching
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

# 3. Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'

βœ… Cloud Development Only

# ALWAYS use cloud platforms for testing and development
# Local machine is weak - no local implementation possible

# Test on HuggingFace Spaces or Scaleway instead
# Deploy to cloud platforms for all testing and development

πŸ“ File Organization Rules

βœ… Required Files (Keep These)

  • app.py - Main production API (v24.1.0 hybrid architecture)
  • lingua_fin/ - Clean Pydantic package structure (local development)
  • utils/ - Utility scripts and tests
  • .env - Contains HF_TOKEN_LC and HF_TOKEN
  • requirements.txt - Production dependencies
  • Dockerfile - Container configuration

βœ… Documentation Files

  • README.md - Main project documentation
  • docs/COMPREHENSIVE_DOCUMENTATION.md - Complete unified documentation
  • docs/PROJECT_RULES.md - This file (MANDATORY REFERENCE)
  • docs/L40_GPU_LIMITATIONS.md - GPU compatibility guide

🚨 Emergency Troubleshooting

If Model Loading Fails:

  1. Check if .env file has HF_TOKEN_LC
  2. Verify virtual environment is activated
  3. Check if model is compatible with L40 GPU
  4. Verify GPU memory availability
  5. Try smaller model first
  6. Remember: No local testing - use cloud platforms only

If Authentication Fails:

  1. Check HF_TOKEN_LC in .env file
  2. Verify token has access to LinguaCustodia organization
  3. Try re-authenticating with login(token=hf_token_lc)

If Space Deployment Fails:

  1. Check HF Space settings for required secrets
  2. Verify hardware configuration (L40 GPU)
  3. Check Dockerfile for proper user setup
  4. Verify port configuration (7860)

πŸ“ Quick Reference Commands

# Activate environment (ALWAYS FIRST)
source venv/bin/activate

# Test Space health
curl https://your-api-url.hf.space/health

# Switch to Qwen model
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

# Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Your question here", "max_new_tokens": 100}'

🎯 REMEMBER: These are the GOLDEN RULES - NEVER CHANGE

  1. βœ… .env contains all keys and secrets
  2. βœ… HF_TOKEN_LC is for pulling models from LinguaCustodia
  3. βœ… HF_TOKEN is for HF repo access and Pro features
  4. βœ… We need to reload because vLLM does not support hot swaps
  5. βœ… We expose OpenAI standard interface
  6. βœ… No local implementation - local machine is weak, use cloud platforms only

This document is the single source of truth for project rules! πŸ“š


Last Updated: October 6, 2025
Version: 24.1.0
Status: Production Ready βœ