# Scaleway L40S GPU Deployment Guide

## Overview

This guide covers deploying LinguaCustodia Financial AI on Scaleway's L40S GPU instances for high-performance inference.

## Instance Configuration

**Hardware:**
- **GPU**: NVIDIA L40S (48GB VRAM)
- **Region**: Paris 2 (fr-par-2)
- **Instance Type**: L40S-1-48G
- **RAM**: 48GB
- **vCPUs**: Dedicated

**Software:**
- **OS**: Ubuntu 24.04 LTS (Scaleway GPU OS 12 Passthrough)
- **NVIDIA Drivers**: Pre-installed
- **Docker**: 28.3.2 with NVIDIA Docker 2.13.0
- **CUDA**: 12.6.3 (runtime via Docker)

## Deployment Architecture

### Docker-Based Deployment

We use a containerized approach with NVIDIA CUDA base images and CUDA graphs optimization:

```
┌─────────────────────────────────────┐
│   Scaleway L40S Instance (Bare Metal)│
│                                       │
│  ┌─────────────────────────────────┐│
│  │  Docker Container                ││
│  │  ├─ CUDA 12.6.3 Runtime         ││
│  │  ├─ Python 3.11                  ││
│  │  ├─ PyTorch 2.8.0                ││
│  │  ├─ Transformers 4.57.0          ││
│  │  └─ LinguaCustodia API (app.py) ││
│  └─────────────────────────────────┘│
│           ↕ --gpus all                │
│  ┌─────────────────────────────────┐│
│  │  NVIDIA L40S GPU (48GB)         ││
│  └─────────────────────────────────┘│
└─────────────────────────────────────┘
```

## Prerequisites

1. **Scaleway Account** with billing enabled
2. **SSH Key** configured in Scaleway console
3. **Local Environment**:
   - Docker installed (for building images locally)
   - SSH access configured
   - Git configured for dual remotes (GitHub + HuggingFace)

## Deployment Steps

### 1. Create L40S Instance

```bash
# Via Scaleway Console or CLI
scw instance server create \
  type=L40S-1-48G \
  zone=fr-par-2 \
  image=ubuntu_focal \
  name=linguacustodia-finance
```

### 2. SSH Setup

```bash
# Add your SSH key to Scaleway
# Then connect
ssh root@<instance-ip>
```

### 3. Upload Files

```bash
# From your local machine
cd /Users/jeanbapt/LLM-Pro-Fin-Inference
scp Dockerfile.scaleway app.py requirements.txt root@<instance-ip>:/root/
```

### 4. Build Docker Image

```bash
# On the L40S instance
cd /root
docker build -f Dockerfile.scaleway -t linguacustodia-api:scaleway .
```

**Build time**: ~2-3 minutes (depends on network speed for downloading dependencies)

### 5. Run Container

```bash
docker run -d \
  --name linguacustodia-api \
  --gpus all \
  -p 7860:7860 \
  -e HF_TOKEN=<your-hf-token> \
  -e HF_TOKEN_LC=<your-linguacustodia-token> \
  -e MODEL_NAME=qwen3-8b \
  -e APP_PORT=7860 \
  -e LOG_LEVEL=INFO \
  -e HF_HOME=/data/.huggingface \
  -v /root/.cache/huggingface:/data/.huggingface \
  --restart unless-stopped \
  linguacustodia-api:scaleway
```

**Important Environment Variables:**
- `HF_TOKEN`: HuggingFace access token
- `HF_TOKEN_LC`: LinguaCustodia model access token
- `MODEL_NAME`: Default model to load (`qwen3-8b`, `gemma3-12b`, `llama3.1-8b`, etc.)
- `HF_HOME`: Model cache directory (persistent across container restarts)

### 6. Verify Deployment

```bash
# Check container status
docker ps

# Check logs
docker logs -f linguacustodia-api

# Test health endpoint
curl http://localhost:7860/health

# Test inference
curl -X POST http://localhost:7860/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is EBITDA?", "max_new_tokens": 100}'
```

## Model Caching Strategy

### First Run (Cold Start)
- Model downloaded from HuggingFace (~16GB for qwen3-8b)
- Cached to `/data/.huggingface` (mapped to `/root/.cache/huggingface` on host)
- Load time: ~5-10 minutes

### Subsequent Runs (Warm Start)
- Model loaded from local cache
- Load time: ~30 seconds

### Model Switching
When switching models via `/load-model` endpoint:
1. GPU memory is cleared
2. New model loaded from cache (if available) or downloaded
3. Previous model cache preserved on disk

## Available Models

| Model ID | Display Name | Parameters | VRAM | Recommended Instance |
|----------|--------------|------------|------|---------------------|
| `qwen3-8b` | Qwen 3 8B Financial | 8B | 8GB | L40S (default) |
| `llama3.1-8b` | Llama 3.1 8B Financial | 8B | 8GB | L40S |
| `gemma3-12b` | Gemma 3 12B Financial | 12B | 12GB | L40S |
| `llama3.1-70b` | Llama 3.1 70B Financial | 70B | 40GB | L40S |
| `fin-pythia-1.4b` | FinPythia 1.4B | 1.4B | 2GB | Any |

## API Endpoints

```bash
# Root Info
GET http://<instance-ip>:7860/

# Health Check
GET http://<instance-ip>:7860/health

# Inference
POST http://<instance-ip>:7860/inference
{
  "prompt": "Your question here",
  "max_new_tokens": 200,
  "temperature": 0.7
}

# Switch Model
POST http://<instance-ip>:7860/load-model
{
  "model_name": "gemma3-12b"
}

# List Available Models
GET http://<instance-ip>:7860/models
```

## CUDA Graphs Optimization

### What are CUDA Graphs?
CUDA graphs eliminate kernel launch overhead by pre-compiling GPU operations into reusable graphs. This provides significant performance improvements for inference workloads.

### Configuration
The Scaleway deployment automatically enables CUDA graphs with these optimizations:
- **`enforce_eager=False`**: Enables CUDA graphs (disabled on HuggingFace for stability)
- **`disable_custom_all_reduce=False`**: Enables custom kernels for better performance
- **`gpu_memory_utilization=0.85`**: Aggressive memory usage (87% actual utilization)
- **Graph Capture**: 67 mixed prefill-decode graphs + 35 decode graphs

### Performance Impact
- **20-30% faster inference** compared to eager mode
- **Reduced latency** for repeated operations
- **Better GPU utilization** (87% vs 75% on HuggingFace)
- **Higher concurrency** (37.36x max concurrent requests)

### Verification
Check CUDA graphs are working by looking for these log messages:
```
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35
Graph capturing finished in 6 secs, took 0.85 GiB
```

## Performance Metrics

### Qwen 3 8B on L40S (with CUDA Graphs)
- **Load Time** (cold): ~5-10 minutes
- **Load Time** (warm): ~30 seconds
- **Inference Speed**: ~80-120 tokens/second (20-30% improvement with CUDA graphs)
- **Memory Usage**: ~15GB VRAM (87% utilization), ~4GB RAM
- **Concurrent Requests**: Up to 37.36x (4K token requests)
- **CUDA Graphs**: 67 mixed prefill-decode + 35 decode graphs captured
- **Response Times**: ~0.37s simple queries, ~3.5s complex financial analysis

## Cost Optimization

### Development/Testing
```bash
# Stop container when not in use
docker stop linguacustodia-api

# Stop instance via Scaleway console
# Billing stops when instance is powered off
```

### Production
- Use `--restart unless-stopped` for automatic recovery
- Monitor with `docker stats linguacustodia-api`
- Set up CloudWatch/Datadog for alerting

## Troubleshooting

### Container Fails to Start

**Symptom**: Container exits immediately

**Solution**:
```bash
# Check logs
docker logs linguacustodia-api

# Common issues:
# 1. Invalid HuggingFace tokens
# 2. Insufficient disk space
# 3. GPU not accessible
```

### "Invalid user token" Error

**Symptom**: `ERROR:app:❌ Failed to load model: Invalid user token.`

**Solution**:
```bash
# Ensure tokens don't have quotes
# Recreate container with correct env vars
docker rm linguacustodia-api
docker run -d --name linguacustodia-api --gpus all \
  -p 7860:7860 \
  -e HF_TOKEN=<token-without-quotes> \
  -e HF_TOKEN_LC=<token-without-quotes> \
  ...
```

### GPU Not Detected

**Symptom**: Model loads on CPU

**Solution**:
```bash
# Verify GPU access
docker exec linguacustodia-api nvidia-smi

# Ensure --gpus all flag is set
docker inspect linguacustodia-api | grep -i gpu
```

### Out of Memory

**Symptom**: `torch.cuda.OutOfMemoryError`

**Solution**:
1. Switch to smaller model (`qwen3-8b` or `fin-pythia-1.4b`)
2. Clear GPU cache:
   ```bash
   docker restart linguacustodia-api
   ```

## Maintenance

### Update Application

```bash
# Upload new app.py
scp app.py root@<instance-ip>:/root/

# Rebuild and restart
ssh root@<instance-ip>
docker build -f Dockerfile.scaleway -t linguacustodia-api:scaleway .
docker stop linguacustodia-api
docker rm linguacustodia-api
# Run command from step 5
```

### Update CUDA Version

Edit `Dockerfile.scaleway`:
```dockerfile
FROM nvidia/cuda:12.7.0-runtime-ubuntu22.04  # Update version
```

Then rebuild.

### Backup Model Cache

```bash
# On L40S instance
tar -czf models-backup.tar.gz /root/.cache/huggingface/
scp models-backup.tar.gz user@backup-server:/backups/
```

## Security

### Network Security
- **Firewall**: Restrict port 7860 to trusted IPs
- **SSH**: Use key-based authentication only
- **Updates**: Regularly update Ubuntu and Docker

### API Security
- **Authentication**: Implement API keys (not included in current version)
- **Rate Limiting**: Use nginx/Caddy as reverse proxy
- **HTTPS**: Set up Let's Encrypt certificates

### Token Management
- Store tokens in `.env` file (never commit to git)
- Use Scaleway Secret Manager for production
- Rotate tokens regularly

## Monitoring

### Resource Usage
```bash
# GPU utilization
nvidia-smi -l 1

# Container stats
docker stats linguacustodia-api

# Disk usage
df -h /root/.cache/huggingface
```

### Application Logs
```bash
# Real-time logs
docker logs -f linguacustodia-api

# Last 100 lines
docker logs --tail 100 linguacustodia-api

# Filter for errors
docker logs linguacustodia-api 2>&1 | grep ERROR
```

## Comparison: Scaleway vs HuggingFace Spaces

| Feature | Scaleway L40S | HuggingFace Spaces |
|---------|---------------|-------------------|
| **GPU** | L40S (48GB) | A10G (24GB) |
| **Control** | Full root access | Limited |
| **Cost** | Pay per hour | Free tier + paid |
| **Uptime** | 100% (if running) | Variable |
| **Setup** | Manual | Automated |
| **Scaling** | Manual | Automatic |
| **Best For** | Production, large models | Prototyping, demos |

## Cost Estimate

**Scaleway L40S Pricing** (as of 2025):
- **Per Hour**: ~$1.50-2.00
- **Per Month** (24/7): ~$1,100-1,450
- **Recommended**: Use on-demand, power off when not in use

**Example Usage**:
- 8 hours/day, 20 days/month: ~$240-320/month
- Development/testing only: ~$50-100/month

## Next Steps

1. **Set up monitoring**: Integrate with your monitoring stack
2. **Implement CI/CD**: Automate deployments with GitHub Actions
3. **Add authentication**: Secure the API with JWT tokens
4. **Scale horizontally**: Deploy multiple instances behind a load balancer
5. **Optimize costs**: Use spot instances or reserved capacity

## Support

- **Scaleway Documentation**: https://www.scaleway.com/en/docs/compute/gpu/
- **LinguaCustodia Issues**: https://github.com/DealExMachina/llm-pro-fin-api/issues
- **NVIDIA Docker**: https://github.com/NVIDIA/nvidia-docker

---

**Last Updated**: October 3, 2025  
**Deployment Status**: ✅ Production-ready  
**Instance**: `51.159.152.233` (Paris 2)