dragonllm-finance-models / docs /README_HF_SPACE.md
jeanbaptdzd's picture
feat: Clean deployment to HuggingFace Space with model config test endpoint
8c0b652
metadata
title: LinguaCustodia Financial AI API
emoji: 🏦
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860

LinguaCustodia Financial AI API

A production-ready FastAPI application for financial AI inference using LinguaCustodia models.

Features

  • Multiple Models: Support for Llama 3.1, Qwen 3, Gemma 3, and Fin-Pythia models
  • FastAPI: High-performance API with automatic documentation
  • Persistent Storage: Models cached for faster restarts
  • GPU Support: Automatic GPU detection and optimization
  • Health Monitoring: Built-in health checks and diagnostics

API Endpoints

  • GET / - API information and status
  • GET /health - Health check with model and GPU status
  • GET /models - List available models and configurations
  • POST /inference - Run inference with the loaded model
  • GET /docs - Interactive API documentation
  • GET /diagnose-imports - Diagnose import issues

Usage

Inference Request

curl -X POST "https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api/inference" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is SFCR in insurance regulation?",
    "max_new_tokens": 150,
    "temperature": 0.6
  }'

Response

{
  "response": "SFCR (Solvency and Financial Condition Report) is a regulatory requirement...",
  "model_used": "LinguaCustodia/llama3.1-8b-fin-v0.3",
  "success": true,
  "tokens_generated": 45,
  "generation_params": {
    "max_new_tokens": 150,
    "temperature": 0.6,
    "eos_token_id": [128001, 128008, 128009],
    "early_stopping": false,
    "min_length": 50
  }
}

Environment Variables

The following environment variables need to be set in the Space settings:

  • HF_TOKEN_LC: HuggingFace token for LinguaCustodia models (required)
  • MODEL_NAME: Model to use (default: "llama3.1-8b")
  • APP_PORT: Application port (default: 7860)

Models Available

βœ… L40 GPU Compatible Models

  • llama3.1-8b: Llama 3.1 8B Financial (16GB RAM, 8GB VRAM) - βœ… Recommended
  • qwen3-8b: Qwen 3 8B Financial (16GB RAM, 8GB VRAM) - βœ… Recommended
  • fin-pythia-1.4b: Fin-Pythia 1.4B Financial (3GB RAM, 2GB VRAM) - βœ… Works

❌ L40 GPU Incompatible Models

  • gemma3-12b: Gemma 3 12B Financial (32GB RAM, 12GB VRAM) - ❌ Too large for L40
  • llama3.1-70b: Llama 3.1 70B Financial (140GB RAM, 80GB VRAM) - ❌ Too large for L40

⚠️ Important: Gemma 3 12B and Llama 3.1 70B models are too large for L40 GPU (48GB VRAM) with vLLM. They will fail during KV cache initialization. Use 8B models for optimal performance.

Architecture

This API uses a hybrid architecture that works in both local development and cloud deployment environments:

  • Clean Architecture: Uses Pydantic models and proper separation of concerns
  • Embedded Fallback: Falls back to embedded configuration when imports fail
  • Persistent Storage: Models are cached in persistent storage for faster restarts
  • GPU Optimization: Automatic GPU detection and memory management

Development

For local development, see the main README.md file.

License

MIT License - see LICENSE file for details.