Spaces:
Runtime error
Runtime error
File size: 10,215 Bytes
8c0b652 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 |
# LinguaCustodia Project Rules & Guidelines
**Version**: 24.1.0
**Last Updated**: October 6, 2025
**Status**: β
Production Ready
---
## π **GOLDEN RULES - NEVER CHANGE**
### **1. Environment Variables (MANDATORY)**
```bash
# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here # For HF repo access and Pro features
MODEL_NAME=qwen3-8b # Default model selection
DEPLOYMENT_ENV=huggingface # Platform configuration
```
**Critical Rules:**
- β
**HF_TOKEN_LC**: For accessing private LinguaCustodia models
- β
**HF_TOKEN**: For HuggingFace Pro account features (endpoints, Spaces, etc.)
- β
**Always load from .env**: `from dotenv import load_dotenv; load_dotenv()`
### **2. Model Reloading (vLLM Limitation)**
```python
# vLLM does not support hot swaps - service restart required
# Solution: Implemented service restart mechanism via /load-model endpoint
# Process: Clear GPU memory β Restart service β Load new model
```
**Critical Rules:**
- β **vLLM does not support hot swaps**
- β
**We need to reload because vLLM does not support hot swaps**
- β
**Service restart mechanism implemented for model switching**
### **3. OpenAI Standard Interface**
```python
# We expose OpenAI standard interface
# Endpoints: /v1/chat/completions, /v1/completions, /v1/models
# Full compatibility for easy integration
```
**Critical Rules:**
- β
**We expose OpenAI standard interface**
- β
**Full OpenAI API compatibility**
- β
**Standard endpoints for easy integration**
---
## π« **NEVER DO THESE**
### **β Token Usage Mistakes**
1. **NEVER** use `HF_TOKEN` for LinguaCustodia model access (use `HF_TOKEN_LC`)
2. **NEVER** use `HF_TOKEN_LC` for HuggingFace Pro features (use `HF_TOKEN`)
3. **NEVER** hardcode tokens in code (always use environment variables)
### **β Model Loading Mistakes**
1. **NEVER** try to hot-swap models with vLLM (service restart required)
2. **NEVER** use 12B+ models on L40 GPU (memory allocation fails)
3. **NEVER** skip GPU memory cleanup during model switching
### **β Deployment Mistakes**
1. **NEVER** skip virtual environment activation
2. **NEVER** use global Python installations
3. **NEVER** forget to load environment variables from .env
4. **NEVER** attempt local implementation or testing (local machine is weak)
---
## β
**ALWAYS DO THESE**
### **β
Environment Setup**
```bash
# ALWAYS activate virtual environment first
cd /Users/jeanbapt/Dragon-fin && source venv/bin/activate
# ALWAYS load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()
```
### **β
Authentication**
```python
# ALWAYS use correct tokens for their purposes
hf_token_lc = os.getenv('HF_TOKEN_LC') # For LinguaCustodia models
hf_token = os.getenv('HF_TOKEN') # For HuggingFace Pro features
# ALWAYS authenticate before accessing models
from huggingface_hub import login
login(token=hf_token_lc) # For model access
```
### **β
Model Configuration**
```python
# ALWAYS use these exact parameters for LinguaCustodia models
model = AutoModelForCausalLM.from_pretrained(
model_name,
token=hf_token_lc,
torch_dtype=torch.bfloat16, # CONFIRMED: All models use bf16
device_map="auto",
trust_remote_code=True,
low_cpu_mem_usage=True
)
```
---
## π **Current Production Configuration**
### **β
Space Configuration**
- **Space URL**: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api
- **Hardware**: L40 GPU (48GB VRAM, $1.80/hour)
- **Backend**: vLLM (official v0.2.0+) with eager mode
- **Port**: 7860 (HuggingFace standard)
- **Status**: Fully operational with vLLM backend abstraction
### **β
API Endpoints**
- **Standard**: /, /health, /inference, /docs, /load-model, /models, /backend
- **OpenAI-compatible**: /v1/chat/completions, /v1/completions, /v1/models
- **Analytics**: /analytics/performance, /analytics/costs, /analytics/usage
### **β
Model Compatibility**
- **L40 GPU Compatible**: Llama 3.1 8B, Qwen 3 8B, Fin-Pythia 1.4B
- **L40 GPU Incompatible**: Gemma 3 12B, Llama 3.1 70B (too large)
### **β
Storage Strategy**
- **Persistent Storage**: `/data/.huggingface` (150GB)
- **Automatic Fallback**: `~/.cache/huggingface` if persistent unavailable
- **Cache Preservation**: Disk cache never cleared (only GPU memory)
---
## π§ **Model Loading Rules**
### **β
Three-Tier Caching Strategy**
1. **First Load**: Downloads and caches to persistent storage
2. **Same Model**: Reuses loaded model in memory (instant)
3. **Model Switch**: Clears GPU memory, loads from disk cache
### **β
Memory Management**
```python
def cleanup_model_memory():
# Delete Python objects
del pipe, model, tokenizer
# Clear GPU cache
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Force garbage collection
gc.collect()
# Disk cache PRESERVED for fast reloading
```
### **β
Model Switching Process**
1. **Clear GPU Memory**: Remove current model from GPU
2. **Service Restart**: Required for vLLM model switching
3. **Load New Model**: From disk cache or download
4. **Initialize vLLM Engine**: With new model configuration
---
## π― **L40 GPU Limitations**
### **β
Compatible Models (Recommended)**
- **Llama 3.1 8B**: ~24GB total memory usage
- **Qwen 3 8B**: ~24GB total memory usage
- **Fin-Pythia 1.4B**: ~6GB total memory usage
### **β Incompatible Models**
- **Gemma 3 12B**: ~45GB needed (exceeds 48GB L40 capacity)
- **Llama 3.1 70B**: ~80GB needed (exceeds 48GB L40 capacity)
### **π Memory Analysis**
```
8B Models (Working):
Model weights: ~16GB β
KV caches: ~8GB β
Inference buffers: ~4GB β
System overhead: ~2GB β
Total used: ~30GB (fits comfortably)
12B+ Models (Failing):
Model weights: ~22GB β
(loads successfully)
KV caches: ~15GB β (allocation fails)
Inference buffers: ~8GB β (allocation fails)
System overhead: ~3GB β (allocation fails)
Total needed: ~48GB (exceeds L40 capacity)
```
---
## π **Deployment Rules**
### **β
HuggingFace Spaces**
- **Use Docker SDK**: With proper user setup (ID 1000)
- **Set hardware**: L40 GPU for optimal performance
- **Use port 7860**: HuggingFace standard
- **Include --chown=user**: For file permissions in Dockerfile
- **Set HF_HOME=/data/.huggingface**: For persistent storage
- **Use 150GB+ persistent storage**: For model caching
### **β
Environment Variables**
```bash
# Required in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface
```
### **β
Docker Configuration**
```dockerfile
# Use python -m uvicorn instead of uvicorn directly
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
# Include --chown=user for file permissions
COPY --chown=user:user . /app
```
---
## π§ͺ **Testing Rules**
### **β
Always Test in This Order**
```bash
# 1. Test health endpoint
curl https://your-api-url.hf.space/health
# 2. Test model switching
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"
# 3. Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
-H "Content-Type: application/json" \
-d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'
```
### **β
Cloud Development Only**
```bash
# ALWAYS use cloud platforms for testing and development
# Local machine is weak - no local implementation possible
# Test on HuggingFace Spaces or Scaleway instead
# Deploy to cloud platforms for all testing and development
```
---
## π **File Organization Rules**
### **β
Required Files (Keep These)**
- `app.py` - Main production API (v24.1.0 hybrid architecture)
- `lingua_fin/` - Clean Pydantic package structure (local development)
- `utils/` - Utility scripts and tests
- `.env` - Contains HF_TOKEN_LC and HF_TOKEN
- `requirements.txt` - Production dependencies
- `Dockerfile` - Container configuration
### **β
Documentation Files**
- `README.md` - Main project documentation
- `docs/COMPREHENSIVE_DOCUMENTATION.md` - Complete unified documentation
- `docs/PROJECT_RULES.md` - This file (MANDATORY REFERENCE)
- `docs/L40_GPU_LIMITATIONS.md` - GPU compatibility guide
---
## π¨ **Emergency Troubleshooting**
### **If Model Loading Fails:**
1. Check if `.env` file has `HF_TOKEN_LC`
2. Verify virtual environment is activated
3. Check if model is compatible with L40 GPU
4. Verify GPU memory availability
5. Try smaller model first
6. **Remember: No local testing - use cloud platforms only**
### **If Authentication Fails:**
1. Check `HF_TOKEN_LC` in `.env` file
2. Verify token has access to LinguaCustodia organization
3. Try re-authenticating with `login(token=hf_token_lc)`
### **If Space Deployment Fails:**
1. Check HF Space settings for required secrets
2. Verify hardware configuration (L40 GPU)
3. Check Dockerfile for proper user setup
4. Verify port configuration (7860)
---
## π **Quick Reference Commands**
```bash
# Activate environment (ALWAYS FIRST)
source venv/bin/activate
# Test Space health
curl https://your-api-url.hf.space/health
# Switch to Qwen model
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"
# Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
-H "Content-Type: application/json" \
-d '{"prompt": "Your question here", "max_new_tokens": 100}'
```
---
## π― **REMEMBER: These are the GOLDEN RULES - NEVER CHANGE**
1. β
**.env contains all keys and secrets**
2. β
**HF_TOKEN_LC is for pulling models from LinguaCustodia**
3. β
**HF_TOKEN is for HF repo access and Pro features**
4. β
**We need to reload because vLLM does not support hot swaps**
5. β
**We expose OpenAI standard interface**
6. β
**No local implementation - local machine is weak, use cloud platforms only**
**This document is the single source of truth for project rules!** π
---
**Last Updated**: October 6, 2025
**Version**: 24.1.0
**Status**: Production Ready β
|