Spaces:
Runtime error
Runtime error
LinguaCustodia Project Rules & Guidelines
Version: 24.1.0
Last Updated: October 6, 2025
Status: β
Production Ready
π GOLDEN RULES - NEVER CHANGE
1. Environment Variables (MANDATORY)
# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here # For HF repo access and Pro features
MODEL_NAME=qwen3-8b # Default model selection
DEPLOYMENT_ENV=huggingface # Platform configuration
Critical Rules:
- β HF_TOKEN_LC: For accessing private LinguaCustodia models
- β HF_TOKEN: For HuggingFace Pro account features (endpoints, Spaces, etc.)
- β
Always load from .env:
from dotenv import load_dotenv; load_dotenv()
2. Model Reloading (vLLM Limitation)
# vLLM does not support hot swaps - service restart required
# Solution: Implemented service restart mechanism via /load-model endpoint
# Process: Clear GPU memory β Restart service β Load new model
Critical Rules:
- β vLLM does not support hot swaps
- β We need to reload because vLLM does not support hot swaps
- β Service restart mechanism implemented for model switching
3. OpenAI Standard Interface
# We expose OpenAI standard interface
# Endpoints: /v1/chat/completions, /v1/completions, /v1/models
# Full compatibility for easy integration
Critical Rules:
- β We expose OpenAI standard interface
- β Full OpenAI API compatibility
- β Standard endpoints for easy integration
π« NEVER DO THESE
β Token Usage Mistakes
- NEVER use
HF_TOKENfor LinguaCustodia model access (useHF_TOKEN_LC) - NEVER use
HF_TOKEN_LCfor HuggingFace Pro features (useHF_TOKEN) - NEVER hardcode tokens in code (always use environment variables)
β Model Loading Mistakes
- NEVER try to hot-swap models with vLLM (service restart required)
- NEVER use 12B+ models on L40 GPU (memory allocation fails)
- NEVER skip GPU memory cleanup during model switching
β Deployment Mistakes
- NEVER skip virtual environment activation
- NEVER use global Python installations
- NEVER forget to load environment variables from .env
- NEVER attempt local implementation or testing (local machine is weak)
β ALWAYS DO THESE
β Environment Setup
# ALWAYS activate virtual environment first
cd /Users/jeanbapt/Dragon-fin && source venv/bin/activate
# ALWAYS load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()
β Authentication
# ALWAYS use correct tokens for their purposes
hf_token_lc = os.getenv('HF_TOKEN_LC') # For LinguaCustodia models
hf_token = os.getenv('HF_TOKEN') # For HuggingFace Pro features
# ALWAYS authenticate before accessing models
from huggingface_hub import login
login(token=hf_token_lc) # For model access
β Model Configuration
# ALWAYS use these exact parameters for LinguaCustodia models
model = AutoModelForCausalLM.from_pretrained(
model_name,
token=hf_token_lc,
torch_dtype=torch.bfloat16, # CONFIRMED: All models use bf16
device_map="auto",
trust_remote_code=True,
low_cpu_mem_usage=True
)
π Current Production Configuration
β Space Configuration
- Space URL: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api
- Hardware: L40 GPU (48GB VRAM, $1.80/hour)
- Backend: vLLM (official v0.2.0+) with eager mode
- Port: 7860 (HuggingFace standard)
- Status: Fully operational with vLLM backend abstraction
β API Endpoints
- Standard: /, /health, /inference, /docs, /load-model, /models, /backend
- OpenAI-compatible: /v1/chat/completions, /v1/completions, /v1/models
- Analytics: /analytics/performance, /analytics/costs, /analytics/usage
β Model Compatibility
- L40 GPU Compatible: Llama 3.1 8B, Qwen 3 8B, Fin-Pythia 1.4B
- L40 GPU Incompatible: Gemma 3 12B, Llama 3.1 70B (too large)
β Storage Strategy
- Persistent Storage:
/data/.huggingface(150GB) - Automatic Fallback:
~/.cache/huggingfaceif persistent unavailable - Cache Preservation: Disk cache never cleared (only GPU memory)
π§ Model Loading Rules
β Three-Tier Caching Strategy
- First Load: Downloads and caches to persistent storage
- Same Model: Reuses loaded model in memory (instant)
- Model Switch: Clears GPU memory, loads from disk cache
β Memory Management
def cleanup_model_memory():
# Delete Python objects
del pipe, model, tokenizer
# Clear GPU cache
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Force garbage collection
gc.collect()
# Disk cache PRESERVED for fast reloading
β Model Switching Process
- Clear GPU Memory: Remove current model from GPU
- Service Restart: Required for vLLM model switching
- Load New Model: From disk cache or download
- Initialize vLLM Engine: With new model configuration
π― L40 GPU Limitations
β Compatible Models (Recommended)
- Llama 3.1 8B: ~24GB total memory usage
- Qwen 3 8B: ~24GB total memory usage
- Fin-Pythia 1.4B: ~6GB total memory usage
β Incompatible Models
- Gemma 3 12B: ~45GB needed (exceeds 48GB L40 capacity)
- Llama 3.1 70B: ~80GB needed (exceeds 48GB L40 capacity)
π Memory Analysis
8B Models (Working):
Model weights: ~16GB β
KV caches: ~8GB β
Inference buffers: ~4GB β
System overhead: ~2GB β
Total used: ~30GB (fits comfortably)
12B+ Models (Failing):
Model weights: ~22GB β
(loads successfully)
KV caches: ~15GB β (allocation fails)
Inference buffers: ~8GB β (allocation fails)
System overhead: ~3GB β (allocation fails)
Total needed: ~48GB (exceeds L40 capacity)
π Deployment Rules
β HuggingFace Spaces
- Use Docker SDK: With proper user setup (ID 1000)
- Set hardware: L40 GPU for optimal performance
- Use port 7860: HuggingFace standard
- Include --chown=user: For file permissions in Dockerfile
- Set HF_HOME=/data/.huggingface: For persistent storage
- Use 150GB+ persistent storage: For model caching
β Environment Variables
# Required in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface
β Docker Configuration
# Use python -m uvicorn instead of uvicorn directly
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
# Include --chown=user for file permissions
COPY --chown=user:user . /app
π§ͺ Testing Rules
β Always Test in This Order
# 1. Test health endpoint
curl https://your-api-url.hf.space/health
# 2. Test model switching
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"
# 3. Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
-H "Content-Type: application/json" \
-d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'
β Cloud Development Only
# ALWAYS use cloud platforms for testing and development
# Local machine is weak - no local implementation possible
# Test on HuggingFace Spaces or Scaleway instead
# Deploy to cloud platforms for all testing and development
π File Organization Rules
β Required Files (Keep These)
app.py- Main production API (v24.1.0 hybrid architecture)lingua_fin/- Clean Pydantic package structure (local development)utils/- Utility scripts and tests.env- Contains HF_TOKEN_LC and HF_TOKENrequirements.txt- Production dependenciesDockerfile- Container configuration
β Documentation Files
README.md- Main project documentationdocs/COMPREHENSIVE_DOCUMENTATION.md- Complete unified documentationdocs/PROJECT_RULES.md- This file (MANDATORY REFERENCE)docs/L40_GPU_LIMITATIONS.md- GPU compatibility guide
π¨ Emergency Troubleshooting
If Model Loading Fails:
- Check if
.envfile hasHF_TOKEN_LC - Verify virtual environment is activated
- Check if model is compatible with L40 GPU
- Verify GPU memory availability
- Try smaller model first
- Remember: No local testing - use cloud platforms only
If Authentication Fails:
- Check
HF_TOKEN_LCin.envfile - Verify token has access to LinguaCustodia organization
- Try re-authenticating with
login(token=hf_token_lc)
If Space Deployment Fails:
- Check HF Space settings for required secrets
- Verify hardware configuration (L40 GPU)
- Check Dockerfile for proper user setup
- Verify port configuration (7860)
π Quick Reference Commands
# Activate environment (ALWAYS FIRST)
source venv/bin/activate
# Test Space health
curl https://your-api-url.hf.space/health
# Switch to Qwen model
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"
# Test inference
curl -X POST "https://your-api-url.hf.space/inference" \
-H "Content-Type: application/json" \
-d '{"prompt": "Your question here", "max_new_tokens": 100}'
π― REMEMBER: These are the GOLDEN RULES - NEVER CHANGE
- β .env contains all keys and secrets
- β HF_TOKEN_LC is for pulling models from LinguaCustodia
- β HF_TOKEN is for HF repo access and Pro features
- β We need to reload because vLLM does not support hot swaps
- β We expose OpenAI standard interface
- β No local implementation - local machine is weak, use cloud platforms only
This document is the single source of truth for project rules! π
Last Updated: October 6, 2025
Version: 24.1.0
Status: Production Ready β