--- title: LinguaCustodia Financial AI API emoji: 🏦 colorFrom: blue colorTo: purple sdk: docker pinned: false license: mit app_port: 7860 --- # LinguaCustodia Financial AI API A production-ready FastAPI application for financial AI inference using LinguaCustodia models. ## Features - **Multiple Models**: Support for Llama 3.1, Qwen 3, Gemma 3, and Fin-Pythia models - **FastAPI**: High-performance API with automatic documentation - **Persistent Storage**: Models cached for faster restarts - **GPU Support**: Automatic GPU detection and optimization - **Health Monitoring**: Built-in health checks and diagnostics ## API Endpoints - `GET /` - API information and status - `GET /health` - Health check with model and GPU status - `GET /models` - List available models and configurations - `POST /inference` - Run inference with the loaded model - `GET /docs` - Interactive API documentation - `GET /diagnose-imports` - Diagnose import issues ## Usage ### Inference Request ```bash curl -X POST "https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api/inference" \ -H "Content-Type: application/json" \ -d '{ "prompt": "What is SFCR in insurance regulation?", "max_new_tokens": 150, "temperature": 0.6 }' ``` ### Response ```json { "response": "SFCR (Solvency and Financial Condition Report) is a regulatory requirement...", "model_used": "LinguaCustodia/llama3.1-8b-fin-v0.3", "success": true, "tokens_generated": 45, "generation_params": { "max_new_tokens": 150, "temperature": 0.6, "eos_token_id": [128001, 128008, 128009], "early_stopping": false, "min_length": 50 } } ``` ## Environment Variables The following environment variables need to be set in the Space settings: - `HF_TOKEN_LC`: HuggingFace token for LinguaCustodia models (required) - `MODEL_NAME`: Model to use (default: "llama3.1-8b") - `APP_PORT`: Application port (default: 7860) ## Models Available ### ✅ **L40 GPU Compatible Models** - **llama3.1-8b**: Llama 3.1 8B Financial (16GB RAM, 8GB VRAM) - ✅ **Recommended** - **qwen3-8b**: Qwen 3 8B Financial (16GB RAM, 8GB VRAM) - ✅ **Recommended** - **fin-pythia-1.4b**: Fin-Pythia 1.4B Financial (3GB RAM, 2GB VRAM) - ✅ Works ### ❌ **L40 GPU Incompatible Models** - **gemma3-12b**: Gemma 3 12B Financial (32GB RAM, 12GB VRAM) - ❌ **Too large for L40** - **llama3.1-70b**: Llama 3.1 70B Financial (140GB RAM, 80GB VRAM) - ❌ **Too large for L40** **⚠️ Important**: Gemma 3 12B and Llama 3.1 70B models are too large for L40 GPU (48GB VRAM) with vLLM. They will fail during KV cache initialization. Use 8B models for optimal performance. ## Architecture This API uses a hybrid architecture that works in both local development and cloud deployment environments: - **Clean Architecture**: Uses Pydantic models and proper separation of concerns - **Embedded Fallback**: Falls back to embedded configuration when imports fail - **Persistent Storage**: Models are cached in persistent storage for faster restarts - **GPU Optimization**: Automatic GPU detection and memory management ## Development For local development, see the main [README.md](README.md) file. ## License MIT License - see LICENSE file for details.