File size: 3,242 Bytes
8c0b652
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: LinguaCustodia Financial AI API
emoji: 🏦
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
---

# LinguaCustodia Financial AI API

A production-ready FastAPI application for financial AI inference using LinguaCustodia models.

## Features

- **Multiple Models**: Support for Llama 3.1, Qwen 3, Gemma 3, and Fin-Pythia models
- **FastAPI**: High-performance API with automatic documentation
- **Persistent Storage**: Models cached for faster restarts
- **GPU Support**: Automatic GPU detection and optimization
- **Health Monitoring**: Built-in health checks and diagnostics

## API Endpoints

- `GET /` - API information and status
- `GET /health` - Health check with model and GPU status
- `GET /models` - List available models and configurations
- `POST /inference` - Run inference with the loaded model
- `GET /docs` - Interactive API documentation
- `GET /diagnose-imports` - Diagnose import issues

## Usage

### Inference Request

```bash
curl -X POST "https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api/inference" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is SFCR in insurance regulation?",
    "max_new_tokens": 150,
    "temperature": 0.6
  }'
```

### Response

```json
{
  "response": "SFCR (Solvency and Financial Condition Report) is a regulatory requirement...",
  "model_used": "LinguaCustodia/llama3.1-8b-fin-v0.3",
  "success": true,
  "tokens_generated": 45,
  "generation_params": {
    "max_new_tokens": 150,
    "temperature": 0.6,
    "eos_token_id": [128001, 128008, 128009],
    "early_stopping": false,
    "min_length": 50
  }
}
```

## Environment Variables

The following environment variables need to be set in the Space settings:

- `HF_TOKEN_LC`: HuggingFace token for LinguaCustodia models (required)
- `MODEL_NAME`: Model to use (default: "llama3.1-8b")
- `APP_PORT`: Application port (default: 7860)

## Models Available

### βœ… **L40 GPU Compatible Models**
- **llama3.1-8b**: Llama 3.1 8B Financial (16GB RAM, 8GB VRAM) - βœ… **Recommended**
- **qwen3-8b**: Qwen 3 8B Financial (16GB RAM, 8GB VRAM) - βœ… **Recommended**
- **fin-pythia-1.4b**: Fin-Pythia 1.4B Financial (3GB RAM, 2GB VRAM) - βœ… Works

### ❌ **L40 GPU Incompatible Models**
- **gemma3-12b**: Gemma 3 12B Financial (32GB RAM, 12GB VRAM) - ❌ **Too large for L40**
- **llama3.1-70b**: Llama 3.1 70B Financial (140GB RAM, 80GB VRAM) - ❌ **Too large for L40**

**⚠️ Important**: Gemma 3 12B and Llama 3.1 70B models are too large for L40 GPU (48GB VRAM) with vLLM. They will fail during KV cache initialization. Use 8B models for optimal performance.

## Architecture

This API uses a hybrid architecture that works in both local development and cloud deployment environments:

- **Clean Architecture**: Uses Pydantic models and proper separation of concerns
- **Embedded Fallback**: Falls back to embedded configuration when imports fail
- **Persistent Storage**: Models are cached in persistent storage for faster restarts
- **GPU Optimization**: Automatic GPU detection and memory management

## Development

For local development, see the main [README.md](README.md) file.

## License

MIT License - see LICENSE file for details.