Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

App Files Files Community

dragonllm-finance-models / docs /project-rules.md

jeanbaptdzd

feat: Clean deployment to HuggingFace Space with model config test endpoint

8c0b652 about 2 months ago

preview code

raw

history blame contribute delete

10.2 kB

LinguaCustodia Project Rules & Guidelines

Version: 24.1.0
Last Updated: October 6, 2025
Status: ✅ Production Ready

🔑 GOLDEN RULES - NEVER CHANGE

1. Environment Variables (MANDATORY)

# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here    # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here      # For HF repo access and Pro features
MODEL_NAME=qwen3-8b                           # Default model selection
DEPLOYMENT_ENV=huggingface                    # Platform configuration

Critical Rules:

✅ HF_TOKEN_LC: For accessing private LinguaCustodia models
✅ HF_TOKEN: For HuggingFace Pro account features (endpoints, Spaces, etc.)
✅ Always load from .env: from dotenv import load_dotenv; load_dotenv()

2. Model Reloading (vLLM Limitation)

# vLLM does not support hot swaps - service restart required
# Solution: Implemented service restart mechanism via /load-model endpoint
# Process: Clear GPU memory → Restart service → Load new model

Critical Rules:

❌ vLLM does not support hot swaps
✅ We need to reload because vLLM does not support hot swaps
✅ Service restart mechanism implemented for model switching

3. OpenAI Standard Interface

# We expose OpenAI standard interface
# Endpoints: /v1/chat/completions, /v1/completions, /v1/models
# Full compatibility for easy integration

Critical Rules:

✅ We expose OpenAI standard interface
✅ Full OpenAI API compatibility
✅ Standard endpoints for easy integration

🚫 NEVER DO THESE

❌ Token Usage Mistakes

NEVER use HF_TOKEN for LinguaCustodia model access (use HF_TOKEN_LC)
NEVER use HF_TOKEN_LC for HuggingFace Pro features (use HF_TOKEN)
NEVER hardcode tokens in code (always use environment variables)

❌ Model Loading Mistakes

NEVER try to hot-swap models with vLLM (service restart required)
NEVER use 12B+ models on L40 GPU (memory allocation fails)
NEVER skip GPU memory cleanup during model switching

❌ Deployment Mistakes

NEVER skip virtual environment activation
NEVER use global Python installations
NEVER forget to load environment variables from .env
NEVER attempt local implementation or testing (local machine is weak)

✅ ALWAYS DO THESE

✅ Environment Setup

# ALWAYS activate virtual environment first
cd /Users/jeanbapt/Dragon-fin && source venv/bin/activate

# ALWAYS load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

✅ Authentication

# ALWAYS use correct tokens for their purposes
hf_token_lc = os.getenv('HF_TOKEN_LC')  # For LinguaCustodia models
hf_token = os.getenv('HF_TOKEN')        # For HuggingFace Pro features

# ALWAYS authenticate before accessing models
from huggingface_hub import login
login(token=hf_token_lc)  # For model access

✅ Model Configuration

# ALWAYS use these exact parameters for LinguaCustodia models
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token_lc,
    torch_dtype=torch.bfloat16,  # CONFIRMED: All models use bf16
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True
)

📊 Current Production Configuration

✅ Space Configuration

Space URL: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api
Hardware: L40 GPU (48GB VRAM, $1.80/hour)
Backend: vLLM (official v0.2.0+) with eager mode
Port: 7860 (HuggingFace standard)
Status: Fully operational with vLLM backend abstraction

✅ API Endpoints

Standard: /, /health, /inference, /docs, /load-model, /models, /backend
OpenAI-compatible: /v1/chat/completions, /v1/completions, /v1/models
Analytics: /analytics/performance, /analytics/costs, /analytics/usage

✅ Model Compatibility

L40 GPU Compatible: Llama 3.1 8B, Qwen 3 8B, Fin-Pythia 1.4B
L40 GPU Incompatible: Gemma 3 12B, Llama 3.1 70B (too large)

✅ Storage Strategy

Persistent Storage: /data/.huggingface (150GB)
Automatic Fallback: ~/.cache/huggingface if persistent unavailable
Cache Preservation: Disk cache never cleared (only GPU memory)

🔧 Model Loading Rules

✅ Three-Tier Caching Strategy

First Load: Downloads and caches to persistent storage
Same Model: Reuses loaded model in memory (instant)
Model Switch: Clears GPU memory, loads from disk cache

✅ Memory Management

def cleanup_model_memory():
    # Delete Python objects
    del pipe, model, tokenizer
    
    # Clear GPU cache
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    
    # Force garbage collection
    gc.collect()
    
    # Disk cache PRESERVED for fast reloading

✅ Model Switching Process

Clear GPU Memory: Remove current model from GPU
Service Restart: Required for vLLM model switching
Load New Model: From disk cache or download
Initialize vLLM Engine: With new model configuration

🎯 L40 GPU Limitations

✅ Compatible Models (Recommended)

Llama 3.1 8B: ~24GB total memory usage
Qwen 3 8B: ~24GB total memory usage
Fin-Pythia 1.4B: ~6GB total memory usage

❌ Incompatible Models

Gemma 3 12B: ~45GB needed (exceeds 48GB L40 capacity)
Llama 3.1 70B: ~80GB needed (exceeds 48GB L40 capacity)

🔍 Memory Analysis

8B Models (Working):
Model weights:        ~16GB ✅
KV caches:           ~8GB  ✅
Inference buffers:   ~4GB  ✅
System overhead:     ~2GB  ✅
Total used:          ~30GB (fits comfortably)

12B+ Models (Failing):
Model weights:        ~22GB ✅ (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)

🚀 Deployment Rules

✅ HuggingFace Spaces

Use Docker SDK: With proper user setup (ID 1000)
Set hardware: L40 GPU for optimal performance
Use port 7860: HuggingFace standard
Include --chown=user: For file permissions in Dockerfile
Set HF_HOME=/data/.huggingface: For persistent storage
Use 150GB+ persistent storage: For model caching

✅ Environment Variables

# Required in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface

✅ Docker Configuration

# Use python -m uvicorn instead of uvicorn directly
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

# Include --chown=user for file permissions
COPY --chown=user:user . /app

🧪 Testing Rules

✅ Always Test in This Order

# 1. Test health endpoint
curl https://your-api-url.hf.space/health

# 2. Test model switching
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

# 3. Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'

✅ Cloud Development Only

# ALWAYS use cloud platforms for testing and development
# Local machine is weak - no local implementation possible

# Test on HuggingFace Spaces or Scaleway instead
# Deploy to cloud platforms for all testing and development

📁 File Organization Rules

✅ Required Files (Keep These)

app.py - Main production API (v24.1.0 hybrid architecture)
lingua_fin/ - Clean Pydantic package structure (local development)
utils/ - Utility scripts and tests
.env - Contains HF_TOKEN_LC and HF_TOKEN
requirements.txt - Production dependencies
Dockerfile - Container configuration

✅ Documentation Files

README.md - Main project documentation
docs/COMPREHENSIVE_DOCUMENTATION.md - Complete unified documentation
docs/PROJECT_RULES.md - This file (MANDATORY REFERENCE)
docs/L40_GPU_LIMITATIONS.md - GPU compatibility guide

🚨 Emergency Troubleshooting

If Model Loading Fails:

Check if .env file has HF_TOKEN_LC
Verify virtual environment is activated
Check if model is compatible with L40 GPU
Verify GPU memory availability
Try smaller model first
Remember: No local testing - use cloud platforms only

If Authentication Fails:

Check HF_TOKEN_LC in .env file
Verify token has access to LinguaCustodia organization
Try re-authenticating with login(token=hf_token_lc)

If Space Deployment Fails:

Check HF Space settings for required secrets
Verify hardware configuration (L40 GPU)
Check Dockerfile for proper user setup
Verify port configuration (7860)

📝 Quick Reference Commands

# Activate environment (ALWAYS FIRST)
source venv/bin/activate

# Test Space health
curl https://your-api-url.hf.space/health

# Switch to Qwen model
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

# Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Your question here", "max_new_tokens": 100}'

🎯 REMEMBER: These are the GOLDEN RULES - NEVER CHANGE

✅ .env contains all keys and secrets
✅ HF_TOKEN_LC is for pulling models from LinguaCustodia
✅ HF_TOKEN is for HF repo access and Pro features
✅ We need to reload because vLLM does not support hot swaps
✅ We expose OpenAI standard interface
✅ No local implementation - local machine is weak, use cloud platforms only

This document is the single source of truth for project rules! 📚

Last Updated: October 6, 2025
Version: 24.1.0
Status: Production Ready ✅