Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

App Files Files Community

dragonllm-finance-models / docs /README_HF_SPACE.md

jeanbaptdzd

feat: Clean deployment to HuggingFace Space with model config test endpoint

8c0b652 about 2 months ago

preview code

raw

history blame contribute delete

3.24 kB

metadata

title: LinguaCustodia Financial AI API
emoji: 🏦
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860

LinguaCustodia Financial AI API

A production-ready FastAPI application for financial AI inference using LinguaCustodia models.

Features

Multiple Models: Support for Llama 3.1, Qwen 3, Gemma 3, and Fin-Pythia models
FastAPI: High-performance API with automatic documentation
Persistent Storage: Models cached for faster restarts
GPU Support: Automatic GPU detection and optimization
Health Monitoring: Built-in health checks and diagnostics

API Endpoints

GET / - API information and status
GET /health - Health check with model and GPU status
GET /models - List available models and configurations
POST /inference - Run inference with the loaded model
GET /docs - Interactive API documentation
GET /diagnose-imports - Diagnose import issues

Usage

Inference Request

curl -X POST "https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api/inference" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is SFCR in insurance regulation?",
    "max_new_tokens": 150,
    "temperature": 0.6
  }'

Response

{
  "response": "SFCR (Solvency and Financial Condition Report) is a regulatory requirement...",
  "model_used": "LinguaCustodia/llama3.1-8b-fin-v0.3",
  "success": true,
  "tokens_generated": 45,
  "generation_params": {
    "max_new_tokens": 150,
    "temperature": 0.6,
    "eos_token_id": [128001, 128008, 128009],
    "early_stopping": false,
    "min_length": 50
  }
}

Environment Variables

The following environment variables need to be set in the Space settings:

HF_TOKEN_LC: HuggingFace token for LinguaCustodia models (required)
MODEL_NAME: Model to use (default: "llama3.1-8b")
APP_PORT: Application port (default: 7860)

Models Available

✅ L40 GPU Compatible Models

llama3.1-8b: Llama 3.1 8B Financial (16GB RAM, 8GB VRAM) - ✅ Recommended
qwen3-8b: Qwen 3 8B Financial (16GB RAM, 8GB VRAM) - ✅ Recommended
fin-pythia-1.4b: Fin-Pythia 1.4B Financial (3GB RAM, 2GB VRAM) - ✅ Works

❌ L40 GPU Incompatible Models

gemma3-12b: Gemma 3 12B Financial (32GB RAM, 12GB VRAM) - ❌ Too large for L40
llama3.1-70b: Llama 3.1 70B Financial (140GB RAM, 80GB VRAM) - ❌ Too large for L40

⚠️ Important: Gemma 3 12B and Llama 3.1 70B models are too large for L40 GPU (48GB VRAM) with vLLM. They will fail during KV cache initialization. Use 8B models for optimal performance.

Architecture

This API uses a hybrid architecture that works in both local development and cloud deployment environments:

Clean Architecture: Uses Pydantic models and proper separation of concerns
Embedded Fallback: Falls back to embedded configuration when imports fail
Persistent Storage: Models are cached in persistent storage for faster restarts
GPU Optimization: Automatic GPU detection and memory management

Development

For local development, see the main README.md file.

License

MIT License - see LICENSE file for details.