Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

File size: 10,215 Bytes

8c0b652

# LinguaCustodia Project Rules & Guidelines

**Version**: 24.1.0  
**Last Updated**: October 6, 2025  
**Status**: ✅ Production Ready

---

## 🔑 **GOLDEN RULES - NEVER CHANGE**

### **1. Environment Variables (MANDATORY)**
```bash
# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here    # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here      # For HF repo access and Pro features
MODEL_NAME=qwen3-8b                           # Default model selection
DEPLOYMENT_ENV=huggingface                    # Platform configuration
```

**Critical Rules:**
- ✅ **HF_TOKEN_LC**: For accessing private LinguaCustodia models
- ✅ **HF_TOKEN**: For HuggingFace Pro account features (endpoints, Spaces, etc.)
- ✅ **Always load from .env**: `from dotenv import load_dotenv; load_dotenv()`

### **2. Model Reloading (vLLM Limitation)**
```python
# vLLM does not support hot swaps - service restart required
# Solution: Implemented service restart mechanism via /load-model endpoint
# Process: Clear GPU memory → Restart service → Load new model
```

**Critical Rules:**
- ❌ **vLLM does not support hot swaps**
- ✅ **We need to reload because vLLM does not support hot swaps**
- ✅ **Service restart mechanism implemented for model switching**

### **3. OpenAI Standard Interface**
```python
# We expose OpenAI standard interface
# Endpoints: /v1/chat/completions, /v1/completions, /v1/models
# Full compatibility for easy integration
```

**Critical Rules:**
- ✅ **We expose OpenAI standard interface**
- ✅ **Full OpenAI API compatibility**
- ✅ **Standard endpoints for easy integration**

---

## 🚫 **NEVER DO THESE**

### **❌ Token Usage Mistakes**
1. **NEVER** use `HF_TOKEN` for LinguaCustodia model access (use `HF_TOKEN_LC`)
2. **NEVER** use `HF_TOKEN_LC` for HuggingFace Pro features (use `HF_TOKEN`)
3. **NEVER** hardcode tokens in code (always use environment variables)

### **❌ Model Loading Mistakes**
1. **NEVER** try to hot-swap models with vLLM (service restart required)
2. **NEVER** use 12B+ models on L40 GPU (memory allocation fails)
3. **NEVER** skip GPU memory cleanup during model switching

### **❌ Deployment Mistakes**
1. **NEVER** skip virtual environment activation
2. **NEVER** use global Python installations
3. **NEVER** forget to load environment variables from .env
4. **NEVER** attempt local implementation or testing (local machine is weak)

---

## ✅ **ALWAYS DO THESE**

### **✅ Environment Setup**
```bash
# ALWAYS activate virtual environment first
cd /Users/jeanbapt/Dragon-fin && source venv/bin/activate

# ALWAYS load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()
```

### **✅ Authentication**
```python
# ALWAYS use correct tokens for their purposes
hf_token_lc = os.getenv('HF_TOKEN_LC')  # For LinguaCustodia models
hf_token = os.getenv('HF_TOKEN')        # For HuggingFace Pro features

# ALWAYS authenticate before accessing models
from huggingface_hub import login
login(token=hf_token_lc)  # For model access
```

### **✅ Model Configuration**
```python
# ALWAYS use these exact parameters for LinguaCustodia models
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token_lc,
    torch_dtype=torch.bfloat16,  # CONFIRMED: All models use bf16
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True
)
```

---

## 📊 **Current Production Configuration**

### **✅ Space Configuration**
- **Space URL**: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api
- **Hardware**: L40 GPU (48GB VRAM, $1.80/hour)
- **Backend**: vLLM (official v0.2.0+) with eager mode
- **Port**: 7860 (HuggingFace standard)
- **Status**: Fully operational with vLLM backend abstraction

### **✅ API Endpoints**
- **Standard**: /, /health, /inference, /docs, /load-model, /models, /backend
- **OpenAI-compatible**: /v1/chat/completions, /v1/completions, /v1/models
- **Analytics**: /analytics/performance, /analytics/costs, /analytics/usage

### **✅ Model Compatibility**
- **L40 GPU Compatible**: Llama 3.1 8B, Qwen 3 8B, Fin-Pythia 1.4B
- **L40 GPU Incompatible**: Gemma 3 12B, Llama 3.1 70B (too large)

### **✅ Storage Strategy**
- **Persistent Storage**: `/data/.huggingface` (150GB)
- **Automatic Fallback**: `~/.cache/huggingface` if persistent unavailable
- **Cache Preservation**: Disk cache never cleared (only GPU memory)

---

## 🔧 **Model Loading Rules**

### **✅ Three-Tier Caching Strategy**
1. **First Load**: Downloads and caches to persistent storage
2. **Same Model**: Reuses loaded model in memory (instant)
3. **Model Switch**: Clears GPU memory, loads from disk cache

### **✅ Memory Management**
```python
def cleanup_model_memory():
    # Delete Python objects
    del pipe, model, tokenizer
    
    # Clear GPU cache
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    
    # Force garbage collection
    gc.collect()
    
    # Disk cache PRESERVED for fast reloading
```

### **✅ Model Switching Process**
1. **Clear GPU Memory**: Remove current model from GPU
2. **Service Restart**: Required for vLLM model switching
3. **Load New Model**: From disk cache or download
4. **Initialize vLLM Engine**: With new model configuration

---

## 🎯 **L40 GPU Limitations**

### **✅ Compatible Models (Recommended)**
- **Llama 3.1 8B**: ~24GB total memory usage
- **Qwen 3 8B**: ~24GB total memory usage  
- **Fin-Pythia 1.4B**: ~6GB total memory usage

### **❌ Incompatible Models**
- **Gemma 3 12B**: ~45GB needed (exceeds 48GB L40 capacity)
- **Llama 3.1 70B**: ~80GB needed (exceeds 48GB L40 capacity)

### **🔍 Memory Analysis**
```
8B Models (Working):
Model weights:        ~16GB ✅
KV caches:           ~8GB  ✅
Inference buffers:   ~4GB  ✅
System overhead:     ~2GB  ✅
Total used:          ~30GB (fits comfortably)

12B+ Models (Failing):
Model weights:        ~22GB ✅ (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)
```

---

## 🚀 **Deployment Rules**

### **✅ HuggingFace Spaces**
- **Use Docker SDK**: With proper user setup (ID 1000)
- **Set hardware**: L40 GPU for optimal performance
- **Use port 7860**: HuggingFace standard
- **Include --chown=user**: For file permissions in Dockerfile
- **Set HF_HOME=/data/.huggingface**: For persistent storage
- **Use 150GB+ persistent storage**: For model caching

### **✅ Environment Variables**
```bash
# Required in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface
```

### **✅ Docker Configuration**
```dockerfile
# Use python -m uvicorn instead of uvicorn directly
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

# Include --chown=user for file permissions
COPY --chown=user:user . /app
```

---

## 🧪 **Testing Rules**

### **✅ Always Test in This Order**
```bash
# 1. Test health endpoint
curl https://your-api-url.hf.space/health

# 2. Test model switching
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

# 3. Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'
```

### **✅ Cloud Development Only**
```bash
# ALWAYS use cloud platforms for testing and development
# Local machine is weak - no local implementation possible

# Test on HuggingFace Spaces or Scaleway instead
# Deploy to cloud platforms for all testing and development
```

---

## 📁 **File Organization Rules**

### **✅ Required Files (Keep These)**
- `app.py` - Main production API (v24.1.0 hybrid architecture)
- `lingua_fin/` - Clean Pydantic package structure (local development)
- `utils/` - Utility scripts and tests
- `.env` - Contains HF_TOKEN_LC and HF_TOKEN
- `requirements.txt` - Production dependencies
- `Dockerfile` - Container configuration

### **✅ Documentation Files**
- `README.md` - Main project documentation
- `docs/COMPREHENSIVE_DOCUMENTATION.md` - Complete unified documentation
- `docs/PROJECT_RULES.md` - This file (MANDATORY REFERENCE)
- `docs/L40_GPU_LIMITATIONS.md` - GPU compatibility guide

---

## 🚨 **Emergency Troubleshooting**

### **If Model Loading Fails:**
1. Check if `.env` file has `HF_TOKEN_LC`
2. Verify virtual environment is activated
3. Check if model is compatible with L40 GPU
4. Verify GPU memory availability
5. Try smaller model first
6. **Remember: No local testing - use cloud platforms only**

### **If Authentication Fails:**
1. Check `HF_TOKEN_LC` in `.env` file
2. Verify token has access to LinguaCustodia organization
3. Try re-authenticating with `login(token=hf_token_lc)`

### **If Space Deployment Fails:**
1. Check HF Space settings for required secrets
2. Verify hardware configuration (L40 GPU)
3. Check Dockerfile for proper user setup
4. Verify port configuration (7860)

---

## 📝 **Quick Reference Commands**

```bash
# Activate environment (ALWAYS FIRST)
source venv/bin/activate

# Test Space health
curl https://your-api-url.hf.space/health

# Switch to Qwen model
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

# Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Your question here", "max_new_tokens": 100}'
```

---

## 🎯 **REMEMBER: These are the GOLDEN RULES - NEVER CHANGE**

1. ✅ **.env contains all keys and secrets**
2. ✅ **HF_TOKEN_LC is for pulling models from LinguaCustodia**
3. ✅ **HF_TOKEN is for HF repo access and Pro features**
4. ✅ **We need to reload because vLLM does not support hot swaps**
5. ✅ **We expose OpenAI standard interface**
6. ✅ **No local implementation - local machine is weak, use cloud platforms only**

**This document is the single source of truth for project rules!** 📚

---

**Last Updated**: October 6, 2025  
**Version**: 24.1.0  
**Status**: Production Ready ✅