# LinguaCustodia Financial AI API - Comprehensive Documentation

**Version**: 24.1.0  
**Last Updated**: October 6, 2025  
**Status**: ✅ Production Ready

---

## 📋 Table of Contents

1. [Project Overview](#project-overview)
2. [Architecture](#architecture)
3. [Golden Rules](#golden-rules)
4. [Model Compatibility](#model-compatibility)
5. [API Reference](#api-reference)
6. [Deployment Guide](#deployment-guide)
7. [Performance & Analytics](#performance--analytics)
8. [Troubleshooting](#troubleshooting)
9. [Development History](#development-history)

---

## 🎯 Project Overview

The LinguaCustodia Financial AI API is a production-ready FastAPI application that provides financial AI inference using specialized LinguaCustodia models. It features dynamic model switching, OpenAI-compatible endpoints, and optimized performance for both HuggingFace Spaces and cloud deployments.

### **Key Features**
- ✅ **Multiple Models**: Llama 3.1, Qwen 3, Gemma 3, Fin-Pythia
- ✅ **Dynamic Model Switching**: Runtime model loading via API
- ✅ **OpenAI Compatibility**: Standard `/v1/chat/completions` interface
- ✅ **vLLM Backend**: High-performance inference engine
- ✅ **Analytics**: Performance monitoring and cost tracking
- ✅ **Multi-Platform**: HuggingFace Spaces, Scaleway, Koyeb support

### **Current Deployment**
- **Space URL**: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api
- **Hardware**: L40 GPU (48GB VRAM)
- **Status**: Fully operational with vLLM backend
- **Current Model**: Qwen 3 8B Financial (recommended for L40)

---

## 🏗️ Architecture

### **Backend Abstraction Layer**

The application uses a platform-specific backend abstraction that automatically selects optimal configurations:

```python
class InferenceBackend:
    """Unified interface for all inference backends."""
    - VLLMBackend: High-performance vLLM engine (primary)
    - TransformersBackend: Fallback for compatibility
```

### **Platform-Specific Configurations**

#### **HuggingFace Spaces (L40 GPU - 48GB VRAM)**
```python
VLLM_CONFIG_HF = {
    "gpu_memory_utilization": 0.75,  # Conservative (36GB of 48GB)
    "max_model_len": 2048,           # HF-optimized
    "enforce_eager": True,           # No CUDA graphs (HF compatibility)
    "disable_custom_all_reduce": True,  # No custom kernels
    "dtype": "bfloat16",
}
```

#### **Scaleway L40S (48GB VRAM)**
```python
VLLM_CONFIG_SCW = {
    "gpu_memory_utilization": 0.85,  # Aggressive (40.8GB of 48GB)
    "max_model_len": 4096,           # Full context length
    "enforce_eager": False,          # CUDA graphs enabled
    "disable_custom_all_reduce": False,  # All optimizations
    "dtype": "bfloat16",
}
```

### **Model Loading Strategy**

Three-tier caching system:
1. **First Load**: Downloads and caches to persistent storage
2. **Same Model**: Reuses loaded model in memory (instant)
3. **Model Switch**: Clears GPU memory, loads from disk cache

---

## 🔑 Golden Rules

### **1. Environment Variables (MANDATORY)**
```bash
# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here    # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here      # For HF repo access and Pro features
MODEL_NAME=qwen3-8b                           # Default model selection
DEPLOYMENT_ENV=huggingface                    # Platform configuration
```

### **2. Token Usage Rules**
- **HF_TOKEN_LC**: For accessing private LinguaCustodia models
- **HF_TOKEN**: For HuggingFace Pro account features (endpoints, Spaces, etc.)

### **3. Model Reloading (vLLM Limitation)**
- **vLLM does not support hot swaps** - service restart required for model switching
- **Solution**: Implemented service restart mechanism via `/load-model` endpoint
- **Process**: Clear GPU memory → Restart service → Load new model

### **4. OpenAI Standard Interface**
- **Exposed**: `/v1/chat/completions`, `/v1/completions`, `/v1/models`
- **Compatibility**: Full OpenAI API compatibility for easy integration
- **Context Management**: Automatic chat formatting and context handling

---

## 📊 Model Compatibility

### **✅ L40 GPU Compatible Models (Recommended)**

| Model | Parameters | VRAM Used | Status | Best For |
|-------|------------|-----------|--------|----------|
| **Llama 3.1 8B** | 8B | ~24GB | ✅ **Recommended** | Development |
| **Qwen 3 8B** | 8B | ~24GB | ✅ **Recommended** | Alternative 8B |
| **Fin-Pythia 1.4B** | 1.4B | ~6GB | ✅ Works | Quick testing |

### **❌ L40 GPU Incompatible Models**

| Model | Parameters | VRAM Needed | Issue |
|-------|------------|-------------|-------|
| **Gemma 3 12B** | 12B | ~45GB | ❌ **Too large** - KV cache allocation fails |
| **Llama 3.1 70B** | 70B | ~80GB | ❌ **Too large** - Exceeds L40 capacity |

### **Memory Analysis**

**Why 12B+ Models Fail on L40:**
```
Model weights:        ~22GB ✅ (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)
```

**8B Models Success:**
```
Model weights:        ~16GB ✅
KV caches:           ~8GB  ✅
Inference buffers:   ~4GB  ✅
System overhead:     ~2GB  ✅
Total used:          ~30GB (fits comfortably)
```

---

## 🔧 API Reference

### **Standard Endpoints**

#### **Health Check**
```bash
GET /health
```
**Response:**
```json
{
  "status": "healthy",
  "model_loaded": true,
  "current_model": "LinguaCustodia/qwen3-8b-fin-v0.3",
  "architecture": "Inline Configuration (HF Optimized) + VLLM",
  "gpu_available": true
}
```

#### **List Models**
```bash
GET /models
```
**Response:**
```json
{
  "current_model": "qwen3-8b",
  "available_models": {
    "llama3.1-8b": "LinguaCustodia/llama3.1-8b-fin-v0.3",
    "qwen3-8b": "LinguaCustodia/qwen3-8b-fin-v0.3",
    "fin-pythia-1.4b": "LinguaCustodia/fin-pythia-1.4b"
  }
}
```

#### **Model Switching**
```bash
POST /load-model?model_name=qwen3-8b
```
**Response:**
```json
{
  "message": "Model 'qwen3-8b' loading started",
  "model_name": "qwen3-8b",
  "display_name": "Qwen 3 8B Financial",
  "status": "loading_started",
  "backend_type": "vllm"
}
```

#### **Inference**
```bash
POST /inference
Content-Type: application/json

{
  "prompt": "What is SFCR in insurance regulation?",
  "max_new_tokens": 150,
  "temperature": 0.6
}
```

### **OpenAI-Compatible Endpoints**

#### **Chat Completions**
```bash
POST /v1/chat/completions
Content-Type: application/json

{
  "model": "qwen3-8b",
  "messages": [
    {"role": "user", "content": "What is Basel III?"}
  ],
  "max_tokens": 150,
  "temperature": 0.6
}
```

#### **Text Completions**
```bash
POST /v1/completions
Content-Type: application/json

{
  "model": "qwen3-8b",
  "prompt": "What is Basel III?",
  "max_tokens": 150,
  "temperature": 0.6
}
```

### **Analytics Endpoints**

#### **Performance Analytics**
```bash
GET /analytics/performance
```

#### **Cost Analytics**
```bash
GET /analytics/costs
```

#### **Usage Analytics**
```bash
GET /analytics/usage
```

---

## 🚀 Deployment Guide

### **HuggingFace Spaces Deployment**

#### **Requirements**
- Dockerfile with `git` installed
- Official vLLM package (`vllm>=0.2.0`)
- Environment variables: `DEPLOYMENT_ENV=huggingface`, `USE_VLLM=true`
- Hardware: L40 GPU (48GB VRAM) - Pro account required

#### **Configuration**
```yaml
# README.md frontmatter
---
title: LinguaCustodia Financial AI API
emoji: 🏦
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
---
```

#### **Environment Variables**
```bash
# Required secrets in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface
```

#### **Storage Configuration**
- **Persistent Storage**: 150GB+ recommended
- **Cache Location**: `/data/.huggingface`
- **Automatic Fallback**: `~/.cache/huggingface` if persistent unavailable

### **Local Development**

#### **Setup**
```bash
# Clone repository
git clone <repository-url>
cd Dragon-fin

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

# Load environment variables
cp env.example .env
# Edit .env with your tokens

# Run application
python app.py
```

#### **Testing**
```bash
# Test health endpoint
curl http://localhost:8000/health

# Test inference
curl -X POST http://localhost:8000/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'
```

---

## 📈 Performance & Analytics

### **Performance Metrics**

#### **HuggingFace Spaces (L40 GPU)**
- **GPU Memory**: 36GB utilized (75% of 48GB)
- **Model Load Time**: ~27 seconds
- **Inference Speed**: Fast with eager mode (conservative)
- **Concurrent Requests**: Optimized batching
- **Configuration**: `enforce_eager=True` for stability

#### **Scaleway L40S (Dedicated GPU)**
- **GPU Memory**: 40.1GB utilized (87% of 48GB)
- **Model Load Time**: ~30 seconds
- **Inference Speed**: 20-30% faster with CUDA graphs
- **Concurrent Requests**: 37.36x max concurrency (4K tokens)
- **Response Times**: ~0.37s simple, ~3.5s complex queries
- **Configuration**: `enforce_eager=False` with CUDA graphs enabled

#### **CUDA Graphs Optimization (Scaleway)**
- **Graph Capture**: 67 mixed prefill-decode + 35 decode graphs
- **Memory Overhead**: 0.85 GiB for graph optimization
- **Performance Gain**: 20-30% faster inference
- **Verification**: Look for "Graph capturing finished" in logs
- **Configuration**: `enforce_eager=False` + `disable_custom_all_reduce=False`

#### **Model Switch Performance**
- **Memory Cleanup**: ~2-3 seconds
- **Loading from Cache**: ~25 seconds
- **Total Switch Time**: ~28 seconds

### **Analytics Features**

#### **Performance Monitoring**
- GPU utilization tracking
- Memory usage monitoring
- Request latency metrics
- Throughput statistics

#### **Cost Tracking**
- Token-based pricing
- Hardware cost calculation
- Usage analytics
- Cost optimization recommendations

#### **Usage Analytics**
- Request patterns
- Model usage statistics
- Error rate monitoring
- Performance trends

---

## 🔧 Troubleshooting

### **Common Issues**

#### **1. Model Loading Failures**
**Issue**: `EngineCore failed to start` during KV cache initialization
**Cause**: Model too large for available GPU memory
**Solution**: Use 8B models instead of 12B+ models on L40 GPU

#### **2. Authentication Errors**
**Issue**: `401 Unauthorized` when accessing models
**Cause**: Incorrect or missing `HF_TOKEN_LC`
**Solution**: Verify token in `.env` file and HF Space settings

#### **3. Memory Issues**
**Issue**: OOM errors during inference
**Cause**: Insufficient GPU memory
**Solution**: Reduce `gpu_memory_utilization` or use smaller model

#### **4. Module Import Errors**
**Issue**: `ModuleNotFoundError` in HuggingFace Spaces
**Cause**: Containerized environment module resolution
**Solution**: Use inline configuration pattern (already implemented)

### **Debug Commands**

#### **Check Space Status**
```bash
curl https://your-api-url.hf.space/health
```

#### **Test Model Switching**
```bash
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"
```

#### **Monitor Loading Progress**
```bash
curl https://your-api-url.hf.space/loading-status
```

---

## 📚 Development History

### **Version Evolution**

#### **v24.1.0 (Current) - Production Ready**
- ✅ vLLM backend integration
- ✅ OpenAI-compatible endpoints
- ✅ Dynamic model switching
- ✅ Analytics and monitoring
- ✅ L40 GPU optimization
- ✅ Comprehensive error handling

#### **v22.1.0 - Hybrid Architecture**
- ✅ Inline configuration pattern
- ✅ HuggingFace Spaces compatibility
- ✅ Model switching via service restart
- ✅ Persistent storage integration

#### **v20.1.0 - Backend Abstraction**
- ✅ Platform-specific configurations
- ✅ HuggingFace/Scaleway support
- ✅ vLLM integration
- ✅ Performance optimizations

### **Key Milestones**

1. **Initial Development**: Basic FastAPI with Transformers backend
2. **Model Integration**: LinguaCustodia model support
3. **Deployment**: HuggingFace Spaces integration
4. **Performance**: vLLM backend implementation
5. **Compatibility**: OpenAI API standard compliance
6. **Analytics**: Performance monitoring and cost tracking
7. **Optimization**: L40 GPU specific configurations

### **Lessons Learned**

1. **HuggingFace Spaces module resolution** differs from local development
2. **Inline configuration** is more reliable for cloud deployments
3. **vLLM requires service restart** for model switching
4. **8B models are optimal** for L40 GPU (48GB VRAM)
5. **Persistent storage** dramatically improves model loading times
6. **OpenAI compatibility** enables easy integration with existing tools

---

## 🎯 Best Practices

### **Model Selection**
- **Use 8B models** for L40 GPU deployments
- **Test locally first** before deploying to production
- **Monitor memory usage** during model switching

### **Performance Optimization**
- **Enable persistent storage** for faster model loading
- **Use appropriate GPU memory utilization** (75% for HF, 85% for Scaleway)
- **Monitor analytics** for performance insights

### **Security**
- **Keep tokens secure** in environment variables
- **Use private endpoints** for sensitive models
- **Implement rate limiting** for production deployments

### **Maintenance**
- **Regular health checks** via `/health` endpoint
- **Monitor error rates** and performance metrics
- **Update dependencies** regularly for security

---

## 📞 Support & Resources

### **Documentation**
- [HuggingFace Spaces Guide](https://huggingface.co/docs/hub/spaces)
- [vLLM Documentation](https://docs.vllm.ai/)
- [LinguaCustodia Models](https://huggingface.co/LinguaCustodia)

### **API Testing**
- **Interactive Docs**: https://your-api-url.hf.space/docs
- **Health Check**: https://your-api-url.hf.space/health
- **Model List**: https://your-api-url.hf.space/models

### **Contact**
- **Issues**: Report via GitHub issues
- **Questions**: Check documentation first, then create issue
- **Contributions**: Follow project guidelines

---

**This documentation represents the complete, unified knowledge base for the LinguaCustodia Financial AI API project.**