File size: 4,951 Bytes
8c0b652
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# vLLM Integration Guide

## Overview

The LinguaCustodia Financial API now uses vLLM as its primary inference backend on both HuggingFace Spaces and Scaleway L40S instances. This provides significant performance improvements through optimized GPU memory management and inference speed.

## Architecture

### Backend Abstraction Layer

The application uses a platform-specific backend abstraction that automatically selects the optimal vLLM configuration based on the deployment environment:

```python
class InferenceBackend:
    """Unified interface for all inference backends."""
    - VLLMBackend: High-performance vLLM engine
    - TransformersBackend: Fallback for compatibility
```

### Platform-Specific Configurations

#### HuggingFace Spaces (L40 GPU - 48GB VRAM)
```python
VLLM_CONFIG_HF = {
    "gpu_memory_utilization": 0.75,  # Conservative (36GB of 48GB)
    "max_model_len": 2048,           # HF-optimized
    "enforce_eager": True,           # No CUDA graphs (HF compatibility)
    "disable_custom_all_reduce": True,  # No custom kernels
    "dtype": "bfloat16",
}
```

**Rationale**: HuggingFace Spaces require conservative settings for stability and compatibility.

#### Scaleway L40S (48GB VRAM)
```python
VLLM_CONFIG_SCW = {
    "gpu_memory_utilization": 0.85,  # Aggressive (40.8GB of 48GB)
    "max_model_len": 4096,           # Full context length
    "enforce_eager": False,          # CUDA graphs enabled
    "disable_custom_all_reduce": False,  # All optimizations
    "dtype": "bfloat16",
}
```

**Rationale**: Dedicated instances can use full optimizations for maximum performance.

## Deployment

### HuggingFace Spaces

**Requirements:**
- Dockerfile with `git` installed (for pip install from GitHub)
- Official vLLM package (`vllm>=0.2.0`)
- Environment variables: `DEPLOYMENT_ENV=huggingface`, `USE_VLLM=true`

**Current Status:**
- βœ… Fully operational with vLLM
- βœ… L40 GPU (48GB VRAM)
- βœ… Eager mode for stability
- βœ… All endpoints functional

### Scaleway L40S

**Requirements:**
- NVIDIA CUDA base image (nvidia/cuda:12.6.3-runtime-ubuntu22.04)
- Official vLLM package (`vllm>=0.2.0`)
- Environment variables: `DEPLOYMENT_ENV=scaleway`, `USE_VLLM=true`

**Current Status:**
- βœ… Ready for deployment
- βœ… Full CUDA graph optimizations
- βœ… Maximum performance configuration

## API Endpoints

### Standard Endpoints
- `POST /inference` - Standard inference with vLLM backend
- `GET /health` - Health check with backend information
- `GET /backend` - Backend configuration details
- `GET /models` - List available models

### OpenAI-Compatible Endpoints
- `POST /v1/chat/completions` - OpenAI chat completion format
- `POST /v1/completions` - OpenAI text completion format
- `GET /v1/models` - List models in OpenAI format

## Performance Metrics

### HuggingFace Spaces (L40 GPU)
- **GPU Memory**: 36GB utilized (75% of 48GB)
- **KV Cache**: 139,968 tokens
- **Max Concurrency**: 68.34x for 2,048 token requests
- **Model Load Time**: ~27 seconds
- **Inference Speed**: Fast with eager mode

### Benefits Over Transformers Backend
- **Memory Efficiency**: 30-40% better GPU utilization
- **Throughput**: Higher concurrent request handling
- **Batching**: Continuous batching for multiple requests
- **API Compatibility**: OpenAI-compatible endpoints

## Troubleshooting

### Common Issues

**1. Build Errors (HuggingFace)**
- **Issue**: Missing `git` in Dockerfile
- **Solution**: Add `git` to apt-get install in Dockerfile

**2. CUDA Compilation Errors**
- **Issue**: Attempting to build from source without compiler
- **Solution**: Use official pre-compiled wheels (`vllm>=0.2.0`)

**3. Memory Issues**
- **Issue**: OOM errors on model load
- **Solution**: Reduce `gpu_memory_utilization` or `max_model_len`

**4. ModelInfo Attribute Errors**
- **Issue**: Using `.get()` on ModelInfo objects
- **Solution**: Use `getattr()` instead of `.get()`

## Configuration Reference

### Environment Variables
```bash
# Deployment configuration
DEPLOYMENT_ENV=huggingface  # or 'scaleway'
USE_VLLM=true

# Model selection
MODEL_NAME=llama3.1-8b  # Default model

# Storage
HF_HOME=/data/.huggingface

# Authentication
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_token
```

### Requirements Files
- `requirements.txt` - HuggingFace (default with official vLLM)
- `requirements-hf.txt` - HuggingFace-specific
- `requirements-scaleway.txt` - Scaleway-specific

## Future Enhancements

- [ ] Implement streaming responses
- [ ] Add request queueing and rate limiting
- [ ] Optimize KV cache settings per model
- [ ] Add metrics and monitoring endpoints
- [ ] Support for multi-GPU setups

## References

- [vLLM Official Documentation](https://docs.vllm.ai/)
- [HuggingFace Spaces Documentation](https://huggingface.co/docs/hub/spaces)
- [LinguaCustodia Models](https://huggingface.co/LinguaCustodia)

---

**Last Updated**: October 4, 2025
**Version**: 24.1.0
**Status**: Production Ready βœ