# LinguaCustodia Financial AI API - Comprehensive Documentation **Version**: 24.1.0 **Last Updated**: October 6, 2025 **Status**: ✅ Production Ready --- ## 📋 Table of Contents 1. [Project Overview](#project-overview) 2. [Architecture](#architecture) 3. [Golden Rules](#golden-rules) 4. [Model Compatibility](#model-compatibility) 5. [API Reference](#api-reference) 6. [Deployment Guide](#deployment-guide) 7. [Performance & Analytics](#performance--analytics) 8. [Troubleshooting](#troubleshooting) 9. [Development History](#development-history) --- ## 🎯 Project Overview The LinguaCustodia Financial AI API is a production-ready FastAPI application that provides financial AI inference using specialized LinguaCustodia models. It features dynamic model switching, OpenAI-compatible endpoints, and optimized performance for both HuggingFace Spaces and cloud deployments. ### **Key Features** - ✅ **Multiple Models**: Llama 3.1, Qwen 3, Gemma 3, Fin-Pythia - ✅ **Dynamic Model Switching**: Runtime model loading via API - ✅ **OpenAI Compatibility**: Standard `/v1/chat/completions` interface - ✅ **vLLM Backend**: High-performance inference engine - ✅ **Analytics**: Performance monitoring and cost tracking - ✅ **Multi-Platform**: HuggingFace Spaces, Scaleway, Koyeb support ### **Current Deployment** - **Space URL**: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api - **Hardware**: L40 GPU (48GB VRAM) - **Status**: Fully operational with vLLM backend - **Current Model**: Qwen 3 8B Financial (recommended for L40) --- ## 🏗️ Architecture ### **Backend Abstraction Layer** The application uses a platform-specific backend abstraction that automatically selects optimal configurations: ```python class InferenceBackend: """Unified interface for all inference backends.""" - VLLMBackend: High-performance vLLM engine (primary) - TransformersBackend: Fallback for compatibility ``` ### **Platform-Specific Configurations** #### **HuggingFace Spaces (L40 GPU - 48GB VRAM)** ```python VLLM_CONFIG_HF = { "gpu_memory_utilization": 0.75, # Conservative (36GB of 48GB) "max_model_len": 2048, # HF-optimized "enforce_eager": True, # No CUDA graphs (HF compatibility) "disable_custom_all_reduce": True, # No custom kernels "dtype": "bfloat16", } ``` #### **Scaleway L40S (48GB VRAM)** ```python VLLM_CONFIG_SCW = { "gpu_memory_utilization": 0.85, # Aggressive (40.8GB of 48GB) "max_model_len": 4096, # Full context length "enforce_eager": False, # CUDA graphs enabled "disable_custom_all_reduce": False, # All optimizations "dtype": "bfloat16", } ``` ### **Model Loading Strategy** Three-tier caching system: 1. **First Load**: Downloads and caches to persistent storage 2. **Same Model**: Reuses loaded model in memory (instant) 3. **Model Switch**: Clears GPU memory, loads from disk cache --- ## 🔑 Golden Rules ### **1. Environment Variables (MANDATORY)** ```bash # .env file contains all keys and secrets HF_TOKEN_LC=your_linguacustodia_token_here # For pulling models from LinguaCustodia HF_TOKEN=your_huggingface_pro_token_here # For HF repo access and Pro features MODEL_NAME=qwen3-8b # Default model selection DEPLOYMENT_ENV=huggingface # Platform configuration ``` ### **2. Token Usage Rules** - **HF_TOKEN_LC**: For accessing private LinguaCustodia models - **HF_TOKEN**: For HuggingFace Pro account features (endpoints, Spaces, etc.) ### **3. Model Reloading (vLLM Limitation)** - **vLLM does not support hot swaps** - service restart required for model switching - **Solution**: Implemented service restart mechanism via `/load-model` endpoint - **Process**: Clear GPU memory → Restart service → Load new model ### **4. OpenAI Standard Interface** - **Exposed**: `/v1/chat/completions`, `/v1/completions`, `/v1/models` - **Compatibility**: Full OpenAI API compatibility for easy integration - **Context Management**: Automatic chat formatting and context handling --- ## 📊 Model Compatibility ### **✅ L40 GPU Compatible Models (Recommended)** | Model | Parameters | VRAM Used | Status | Best For | |-------|------------|-----------|--------|----------| | **Llama 3.1 8B** | 8B | ~24GB | ✅ **Recommended** | Development | | **Qwen 3 8B** | 8B | ~24GB | ✅ **Recommended** | Alternative 8B | | **Fin-Pythia 1.4B** | 1.4B | ~6GB | ✅ Works | Quick testing | ### **❌ L40 GPU Incompatible Models** | Model | Parameters | VRAM Needed | Issue | |-------|------------|-------------|-------| | **Gemma 3 12B** | 12B | ~45GB | ❌ **Too large** - KV cache allocation fails | | **Llama 3.1 70B** | 70B | ~80GB | ❌ **Too large** - Exceeds L40 capacity | ### **Memory Analysis** **Why 12B+ Models Fail on L40:** ``` Model weights: ~22GB ✅ (loads successfully) KV caches: ~15GB ❌ (allocation fails) Inference buffers: ~8GB ❌ (allocation fails) System overhead: ~3GB ❌ (allocation fails) Total needed: ~48GB (exceeds L40 capacity) ``` **8B Models Success:** ``` Model weights: ~16GB ✅ KV caches: ~8GB ✅ Inference buffers: ~4GB ✅ System overhead: ~2GB ✅ Total used: ~30GB (fits comfortably) ``` --- ## 🔧 API Reference ### **Standard Endpoints** #### **Health Check** ```bash GET /health ``` **Response:** ```json { "status": "healthy", "model_loaded": true, "current_model": "LinguaCustodia/qwen3-8b-fin-v0.3", "architecture": "Inline Configuration (HF Optimized) + VLLM", "gpu_available": true } ``` #### **List Models** ```bash GET /models ``` **Response:** ```json { "current_model": "qwen3-8b", "available_models": { "llama3.1-8b": "LinguaCustodia/llama3.1-8b-fin-v0.3", "qwen3-8b": "LinguaCustodia/qwen3-8b-fin-v0.3", "fin-pythia-1.4b": "LinguaCustodia/fin-pythia-1.4b" } } ``` #### **Model Switching** ```bash POST /load-model?model_name=qwen3-8b ``` **Response:** ```json { "message": "Model 'qwen3-8b' loading started", "model_name": "qwen3-8b", "display_name": "Qwen 3 8B Financial", "status": "loading_started", "backend_type": "vllm" } ``` #### **Inference** ```bash POST /inference Content-Type: application/json { "prompt": "What is SFCR in insurance regulation?", "max_new_tokens": 150, "temperature": 0.6 } ``` ### **OpenAI-Compatible Endpoints** #### **Chat Completions** ```bash POST /v1/chat/completions Content-Type: application/json { "model": "qwen3-8b", "messages": [ {"role": "user", "content": "What is Basel III?"} ], "max_tokens": 150, "temperature": 0.6 } ``` #### **Text Completions** ```bash POST /v1/completions Content-Type: application/json { "model": "qwen3-8b", "prompt": "What is Basel III?", "max_tokens": 150, "temperature": 0.6 } ``` ### **Analytics Endpoints** #### **Performance Analytics** ```bash GET /analytics/performance ``` #### **Cost Analytics** ```bash GET /analytics/costs ``` #### **Usage Analytics** ```bash GET /analytics/usage ``` --- ## 🚀 Deployment Guide ### **HuggingFace Spaces Deployment** #### **Requirements** - Dockerfile with `git` installed - Official vLLM package (`vllm>=0.2.0`) - Environment variables: `DEPLOYMENT_ENV=huggingface`, `USE_VLLM=true` - Hardware: L40 GPU (48GB VRAM) - Pro account required #### **Configuration** ```yaml # README.md frontmatter --- title: LinguaCustodia Financial AI API emoji: 🏦 colorFrom: blue colorTo: purple sdk: docker pinned: false license: mit app_port: 7860 --- ``` #### **Environment Variables** ```bash # Required secrets in HF Space settings HF_TOKEN_LC=your_linguacustodia_token HF_TOKEN=your_huggingface_pro_token MODEL_NAME=qwen3-8b DEPLOYMENT_ENV=huggingface HF_HOME=/data/.huggingface ``` #### **Storage Configuration** - **Persistent Storage**: 150GB+ recommended - **Cache Location**: `/data/.huggingface` - **Automatic Fallback**: `~/.cache/huggingface` if persistent unavailable ### **Local Development** #### **Setup** ```bash # Clone repository git clone cd Dragon-fin # Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac # or venv\Scripts\activate # Windows # Install dependencies pip install -r requirements.txt # Load environment variables cp env.example .env # Edit .env with your tokens # Run application python app.py ``` #### **Testing** ```bash # Test health endpoint curl http://localhost:8000/health # Test inference curl -X POST http://localhost:8000/inference \ -H "Content-Type: application/json" \ -d '{"prompt": "What is SFCR?", "max_new_tokens": 100}' ``` --- ## 📈 Performance & Analytics ### **Performance Metrics** #### **HuggingFace Spaces (L40 GPU)** - **GPU Memory**: 36GB utilized (75% of 48GB) - **Model Load Time**: ~27 seconds - **Inference Speed**: Fast with eager mode (conservative) - **Concurrent Requests**: Optimized batching - **Configuration**: `enforce_eager=True` for stability #### **Scaleway L40S (Dedicated GPU)** - **GPU Memory**: 40.1GB utilized (87% of 48GB) - **Model Load Time**: ~30 seconds - **Inference Speed**: 20-30% faster with CUDA graphs - **Concurrent Requests**: 37.36x max concurrency (4K tokens) - **Response Times**: ~0.37s simple, ~3.5s complex queries - **Configuration**: `enforce_eager=False` with CUDA graphs enabled #### **CUDA Graphs Optimization (Scaleway)** - **Graph Capture**: 67 mixed prefill-decode + 35 decode graphs - **Memory Overhead**: 0.85 GiB for graph optimization - **Performance Gain**: 20-30% faster inference - **Verification**: Look for "Graph capturing finished" in logs - **Configuration**: `enforce_eager=False` + `disable_custom_all_reduce=False` #### **Model Switch Performance** - **Memory Cleanup**: ~2-3 seconds - **Loading from Cache**: ~25 seconds - **Total Switch Time**: ~28 seconds ### **Analytics Features** #### **Performance Monitoring** - GPU utilization tracking - Memory usage monitoring - Request latency metrics - Throughput statistics #### **Cost Tracking** - Token-based pricing - Hardware cost calculation - Usage analytics - Cost optimization recommendations #### **Usage Analytics** - Request patterns - Model usage statistics - Error rate monitoring - Performance trends --- ## 🔧 Troubleshooting ### **Common Issues** #### **1. Model Loading Failures** **Issue**: `EngineCore failed to start` during KV cache initialization **Cause**: Model too large for available GPU memory **Solution**: Use 8B models instead of 12B+ models on L40 GPU #### **2. Authentication Errors** **Issue**: `401 Unauthorized` when accessing models **Cause**: Incorrect or missing `HF_TOKEN_LC` **Solution**: Verify token in `.env` file and HF Space settings #### **3. Memory Issues** **Issue**: OOM errors during inference **Cause**: Insufficient GPU memory **Solution**: Reduce `gpu_memory_utilization` or use smaller model #### **4. Module Import Errors** **Issue**: `ModuleNotFoundError` in HuggingFace Spaces **Cause**: Containerized environment module resolution **Solution**: Use inline configuration pattern (already implemented) ### **Debug Commands** #### **Check Space Status** ```bash curl https://your-api-url.hf.space/health ``` #### **Test Model Switching** ```bash curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b" ``` #### **Monitor Loading Progress** ```bash curl https://your-api-url.hf.space/loading-status ``` --- ## 📚 Development History ### **Version Evolution** #### **v24.1.0 (Current) - Production Ready** - ✅ vLLM backend integration - ✅ OpenAI-compatible endpoints - ✅ Dynamic model switching - ✅ Analytics and monitoring - ✅ L40 GPU optimization - ✅ Comprehensive error handling #### **v22.1.0 - Hybrid Architecture** - ✅ Inline configuration pattern - ✅ HuggingFace Spaces compatibility - ✅ Model switching via service restart - ✅ Persistent storage integration #### **v20.1.0 - Backend Abstraction** - ✅ Platform-specific configurations - ✅ HuggingFace/Scaleway support - ✅ vLLM integration - ✅ Performance optimizations ### **Key Milestones** 1. **Initial Development**: Basic FastAPI with Transformers backend 2. **Model Integration**: LinguaCustodia model support 3. **Deployment**: HuggingFace Spaces integration 4. **Performance**: vLLM backend implementation 5. **Compatibility**: OpenAI API standard compliance 6. **Analytics**: Performance monitoring and cost tracking 7. **Optimization**: L40 GPU specific configurations ### **Lessons Learned** 1. **HuggingFace Spaces module resolution** differs from local development 2. **Inline configuration** is more reliable for cloud deployments 3. **vLLM requires service restart** for model switching 4. **8B models are optimal** for L40 GPU (48GB VRAM) 5. **Persistent storage** dramatically improves model loading times 6. **OpenAI compatibility** enables easy integration with existing tools --- ## 🎯 Best Practices ### **Model Selection** - **Use 8B models** for L40 GPU deployments - **Test locally first** before deploying to production - **Monitor memory usage** during model switching ### **Performance Optimization** - **Enable persistent storage** for faster model loading - **Use appropriate GPU memory utilization** (75% for HF, 85% for Scaleway) - **Monitor analytics** for performance insights ### **Security** - **Keep tokens secure** in environment variables - **Use private endpoints** for sensitive models - **Implement rate limiting** for production deployments ### **Maintenance** - **Regular health checks** via `/health` endpoint - **Monitor error rates** and performance metrics - **Update dependencies** regularly for security --- ## 📞 Support & Resources ### **Documentation** - [HuggingFace Spaces Guide](https://huggingface.co/docs/hub/spaces) - [vLLM Documentation](https://docs.vllm.ai/) - [LinguaCustodia Models](https://huggingface.co/LinguaCustodia) ### **API Testing** - **Interactive Docs**: https://your-api-url.hf.space/docs - **Health Check**: https://your-api-url.hf.space/health - **Model List**: https://your-api-url.hf.space/models ### **Contact** - **Issues**: Report via GitHub issues - **Questions**: Check documentation first, then create issue - **Contributions**: Follow project guidelines --- **This documentation represents the complete, unified knowledge base for the LinguaCustodia Financial AI API project.**