File size: 10,215 Bytes
8c0b652
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
# LinguaCustodia Project Rules & Guidelines

**Version**: 24.1.0  
**Last Updated**: October 6, 2025  
**Status**: βœ… Production Ready

---

## πŸ”‘ **GOLDEN RULES - NEVER CHANGE**

### **1. Environment Variables (MANDATORY)**
```bash
# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here    # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here      # For HF repo access and Pro features
MODEL_NAME=qwen3-8b                           # Default model selection
DEPLOYMENT_ENV=huggingface                    # Platform configuration
```

**Critical Rules:**
- βœ… **HF_TOKEN_LC**: For accessing private LinguaCustodia models
- βœ… **HF_TOKEN**: For HuggingFace Pro account features (endpoints, Spaces, etc.)
- βœ… **Always load from .env**: `from dotenv import load_dotenv; load_dotenv()`

### **2. Model Reloading (vLLM Limitation)**
```python
# vLLM does not support hot swaps - service restart required
# Solution: Implemented service restart mechanism via /load-model endpoint
# Process: Clear GPU memory β†’ Restart service β†’ Load new model
```

**Critical Rules:**
- ❌ **vLLM does not support hot swaps**
- βœ… **We need to reload because vLLM does not support hot swaps**
- βœ… **Service restart mechanism implemented for model switching**

### **3. OpenAI Standard Interface**
```python
# We expose OpenAI standard interface
# Endpoints: /v1/chat/completions, /v1/completions, /v1/models
# Full compatibility for easy integration
```

**Critical Rules:**
- βœ… **We expose OpenAI standard interface**
- βœ… **Full OpenAI API compatibility**
- βœ… **Standard endpoints for easy integration**

---

## 🚫 **NEVER DO THESE**

### **❌ Token Usage Mistakes**
1. **NEVER** use `HF_TOKEN` for LinguaCustodia model access (use `HF_TOKEN_LC`)
2. **NEVER** use `HF_TOKEN_LC` for HuggingFace Pro features (use `HF_TOKEN`)
3. **NEVER** hardcode tokens in code (always use environment variables)

### **❌ Model Loading Mistakes**
1. **NEVER** try to hot-swap models with vLLM (service restart required)
2. **NEVER** use 12B+ models on L40 GPU (memory allocation fails)
3. **NEVER** skip GPU memory cleanup during model switching

### **❌ Deployment Mistakes**
1. **NEVER** skip virtual environment activation
2. **NEVER** use global Python installations
3. **NEVER** forget to load environment variables from .env
4. **NEVER** attempt local implementation or testing (local machine is weak)

---

## βœ… **ALWAYS DO THESE**

### **βœ… Environment Setup**
```bash
# ALWAYS activate virtual environment first
cd /Users/jeanbapt/Dragon-fin && source venv/bin/activate

# ALWAYS load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()
```

### **βœ… Authentication**
```python
# ALWAYS use correct tokens for their purposes
hf_token_lc = os.getenv('HF_TOKEN_LC')  # For LinguaCustodia models
hf_token = os.getenv('HF_TOKEN')        # For HuggingFace Pro features

# ALWAYS authenticate before accessing models
from huggingface_hub import login
login(token=hf_token_lc)  # For model access
```

### **βœ… Model Configuration**
```python
# ALWAYS use these exact parameters for LinguaCustodia models
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token_lc,
    torch_dtype=torch.bfloat16,  # CONFIRMED: All models use bf16
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True
)
```

---

## πŸ“Š **Current Production Configuration**

### **βœ… Space Configuration**
- **Space URL**: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api
- **Hardware**: L40 GPU (48GB VRAM, $1.80/hour)
- **Backend**: vLLM (official v0.2.0+) with eager mode
- **Port**: 7860 (HuggingFace standard)
- **Status**: Fully operational with vLLM backend abstraction

### **βœ… API Endpoints**
- **Standard**: /, /health, /inference, /docs, /load-model, /models, /backend
- **OpenAI-compatible**: /v1/chat/completions, /v1/completions, /v1/models
- **Analytics**: /analytics/performance, /analytics/costs, /analytics/usage

### **βœ… Model Compatibility**
- **L40 GPU Compatible**: Llama 3.1 8B, Qwen 3 8B, Fin-Pythia 1.4B
- **L40 GPU Incompatible**: Gemma 3 12B, Llama 3.1 70B (too large)

### **βœ… Storage Strategy**
- **Persistent Storage**: `/data/.huggingface` (150GB)
- **Automatic Fallback**: `~/.cache/huggingface` if persistent unavailable
- **Cache Preservation**: Disk cache never cleared (only GPU memory)

---

## πŸ”§ **Model Loading Rules**

### **βœ… Three-Tier Caching Strategy**
1. **First Load**: Downloads and caches to persistent storage
2. **Same Model**: Reuses loaded model in memory (instant)
3. **Model Switch**: Clears GPU memory, loads from disk cache

### **βœ… Memory Management**
```python
def cleanup_model_memory():
    # Delete Python objects
    del pipe, model, tokenizer
    
    # Clear GPU cache
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    
    # Force garbage collection
    gc.collect()
    
    # Disk cache PRESERVED for fast reloading
```

### **βœ… Model Switching Process**
1. **Clear GPU Memory**: Remove current model from GPU
2. **Service Restart**: Required for vLLM model switching
3. **Load New Model**: From disk cache or download
4. **Initialize vLLM Engine**: With new model configuration

---

## 🎯 **L40 GPU Limitations**

### **βœ… Compatible Models (Recommended)**
- **Llama 3.1 8B**: ~24GB total memory usage
- **Qwen 3 8B**: ~24GB total memory usage  
- **Fin-Pythia 1.4B**: ~6GB total memory usage

### **❌ Incompatible Models**
- **Gemma 3 12B**: ~45GB needed (exceeds 48GB L40 capacity)
- **Llama 3.1 70B**: ~80GB needed (exceeds 48GB L40 capacity)

### **πŸ” Memory Analysis**
```
8B Models (Working):
Model weights:        ~16GB βœ…
KV caches:           ~8GB  βœ…
Inference buffers:   ~4GB  βœ…
System overhead:     ~2GB  βœ…
Total used:          ~30GB (fits comfortably)

12B+ Models (Failing):
Model weights:        ~22GB βœ… (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)
```

---

## πŸš€ **Deployment Rules**

### **βœ… HuggingFace Spaces**
- **Use Docker SDK**: With proper user setup (ID 1000)
- **Set hardware**: L40 GPU for optimal performance
- **Use port 7860**: HuggingFace standard
- **Include --chown=user**: For file permissions in Dockerfile
- **Set HF_HOME=/data/.huggingface**: For persistent storage
- **Use 150GB+ persistent storage**: For model caching

### **βœ… Environment Variables**
```bash
# Required in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface
```

### **βœ… Docker Configuration**
```dockerfile
# Use python -m uvicorn instead of uvicorn directly
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

# Include --chown=user for file permissions
COPY --chown=user:user . /app
```

---

## πŸ§ͺ **Testing Rules**

### **βœ… Always Test in This Order**
```bash
# 1. Test health endpoint
curl https://your-api-url.hf.space/health

# 2. Test model switching
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

# 3. Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'
```

### **βœ… Cloud Development Only**
```bash
# ALWAYS use cloud platforms for testing and development
# Local machine is weak - no local implementation possible

# Test on HuggingFace Spaces or Scaleway instead
# Deploy to cloud platforms for all testing and development
```

---

## πŸ“ **File Organization Rules**

### **βœ… Required Files (Keep These)**
- `app.py` - Main production API (v24.1.0 hybrid architecture)
- `lingua_fin/` - Clean Pydantic package structure (local development)
- `utils/` - Utility scripts and tests
- `.env` - Contains HF_TOKEN_LC and HF_TOKEN
- `requirements.txt` - Production dependencies
- `Dockerfile` - Container configuration

### **βœ… Documentation Files**
- `README.md` - Main project documentation
- `docs/COMPREHENSIVE_DOCUMENTATION.md` - Complete unified documentation
- `docs/PROJECT_RULES.md` - This file (MANDATORY REFERENCE)
- `docs/L40_GPU_LIMITATIONS.md` - GPU compatibility guide

---

## 🚨 **Emergency Troubleshooting**

### **If Model Loading Fails:**
1. Check if `.env` file has `HF_TOKEN_LC`
2. Verify virtual environment is activated
3. Check if model is compatible with L40 GPU
4. Verify GPU memory availability
5. Try smaller model first
6. **Remember: No local testing - use cloud platforms only**

### **If Authentication Fails:**
1. Check `HF_TOKEN_LC` in `.env` file
2. Verify token has access to LinguaCustodia organization
3. Try re-authenticating with `login(token=hf_token_lc)`

### **If Space Deployment Fails:**
1. Check HF Space settings for required secrets
2. Verify hardware configuration (L40 GPU)
3. Check Dockerfile for proper user setup
4. Verify port configuration (7860)

---

## πŸ“ **Quick Reference Commands**

```bash
# Activate environment (ALWAYS FIRST)
source venv/bin/activate

# Test Space health
curl https://your-api-url.hf.space/health

# Switch to Qwen model
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

# Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Your question here", "max_new_tokens": 100}'
```

---

## 🎯 **REMEMBER: These are the GOLDEN RULES - NEVER CHANGE**

1. βœ… **.env contains all keys and secrets**
2. βœ… **HF_TOKEN_LC is for pulling models from LinguaCustodia**
3. βœ… **HF_TOKEN is for HF repo access and Pro features**
4. βœ… **We need to reload because vLLM does not support hot swaps**
5. βœ… **We expose OpenAI standard interface**
6. βœ… **No local implementation - local machine is weak, use cloud platforms only**

**This document is the single source of truth for project rules!** πŸ“š

---

**Last Updated**: October 6, 2025  
**Version**: 24.1.0  
**Status**: Production Ready βœ