# Scaleway L40S GPU Deployment Guide ## Overview This guide covers deploying LinguaCustodia Financial AI on Scaleway's L40S GPU instances for high-performance inference. ## Instance Configuration **Hardware:** - **GPU**: NVIDIA L40S (48GB VRAM) - **Region**: Paris 2 (fr-par-2) - **Instance Type**: L40S-1-48G - **RAM**: 48GB - **vCPUs**: Dedicated **Software:** - **OS**: Ubuntu 24.04 LTS (Scaleway GPU OS 12 Passthrough) - **NVIDIA Drivers**: Pre-installed - **Docker**: 28.3.2 with NVIDIA Docker 2.13.0 - **CUDA**: 12.6.3 (runtime via Docker) ## Deployment Architecture ### Docker-Based Deployment We use a containerized approach with NVIDIA CUDA base images and CUDA graphs optimization: ``` ┌─────────────────────────────────────┐ │ Scaleway L40S Instance (Bare Metal)│ │ │ │ ┌─────────────────────────────────┐│ │ │ Docker Container ││ │ │ ├─ CUDA 12.6.3 Runtime ││ │ │ ├─ Python 3.11 ││ │ │ ├─ PyTorch 2.8.0 ││ │ │ ├─ Transformers 4.57.0 ││ │ │ └─ LinguaCustodia API (app.py) ││ │ └─────────────────────────────────┘│ │ ↕ --gpus all │ │ ┌─────────────────────────────────┐│ │ │ NVIDIA L40S GPU (48GB) ││ │ └─────────────────────────────────┘│ └─────────────────────────────────────┘ ``` ## Prerequisites 1. **Scaleway Account** with billing enabled 2. **SSH Key** configured in Scaleway console 3. **Local Environment**: - Docker installed (for building images locally) - SSH access configured - Git configured for dual remotes (GitHub + HuggingFace) ## Deployment Steps ### 1. Create L40S Instance ```bash # Via Scaleway Console or CLI scw instance server create \ type=L40S-1-48G \ zone=fr-par-2 \ image=ubuntu_focal \ name=linguacustodia-finance ``` ### 2. SSH Setup ```bash # Add your SSH key to Scaleway # Then connect ssh root@ ``` ### 3. Upload Files ```bash # From your local machine cd /Users/jeanbapt/LLM-Pro-Fin-Inference scp Dockerfile.scaleway app.py requirements.txt root@:/root/ ``` ### 4. Build Docker Image ```bash # On the L40S instance cd /root docker build -f Dockerfile.scaleway -t linguacustodia-api:scaleway . ``` **Build time**: ~2-3 minutes (depends on network speed for downloading dependencies) ### 5. Run Container ```bash docker run -d \ --name linguacustodia-api \ --gpus all \ -p 7860:7860 \ -e HF_TOKEN= \ -e HF_TOKEN_LC= \ -e MODEL_NAME=qwen3-8b \ -e APP_PORT=7860 \ -e LOG_LEVEL=INFO \ -e HF_HOME=/data/.huggingface \ -v /root/.cache/huggingface:/data/.huggingface \ --restart unless-stopped \ linguacustodia-api:scaleway ``` **Important Environment Variables:** - `HF_TOKEN`: HuggingFace access token - `HF_TOKEN_LC`: LinguaCustodia model access token - `MODEL_NAME`: Default model to load (`qwen3-8b`, `gemma3-12b`, `llama3.1-8b`, etc.) - `HF_HOME`: Model cache directory (persistent across container restarts) ### 6. Verify Deployment ```bash # Check container status docker ps # Check logs docker logs -f linguacustodia-api # Test health endpoint curl http://localhost:7860/health # Test inference curl -X POST http://localhost:7860/inference \ -H "Content-Type: application/json" \ -d '{"prompt": "What is EBITDA?", "max_new_tokens": 100}' ``` ## Model Caching Strategy ### First Run (Cold Start) - Model downloaded from HuggingFace (~16GB for qwen3-8b) - Cached to `/data/.huggingface` (mapped to `/root/.cache/huggingface` on host) - Load time: ~5-10 minutes ### Subsequent Runs (Warm Start) - Model loaded from local cache - Load time: ~30 seconds ### Model Switching When switching models via `/load-model` endpoint: 1. GPU memory is cleared 2. New model loaded from cache (if available) or downloaded 3. Previous model cache preserved on disk ## Available Models | Model ID | Display Name | Parameters | VRAM | Recommended Instance | |----------|--------------|------------|------|---------------------| | `qwen3-8b` | Qwen 3 8B Financial | 8B | 8GB | L40S (default) | | `llama3.1-8b` | Llama 3.1 8B Financial | 8B | 8GB | L40S | | `gemma3-12b` | Gemma 3 12B Financial | 12B | 12GB | L40S | | `llama3.1-70b` | Llama 3.1 70B Financial | 70B | 40GB | L40S | | `fin-pythia-1.4b` | FinPythia 1.4B | 1.4B | 2GB | Any | ## API Endpoints ```bash # Root Info GET http://:7860/ # Health Check GET http://:7860/health # Inference POST http://:7860/inference { "prompt": "Your question here", "max_new_tokens": 200, "temperature": 0.7 } # Switch Model POST http://:7860/load-model { "model_name": "gemma3-12b" } # List Available Models GET http://:7860/models ``` ## CUDA Graphs Optimization ### What are CUDA Graphs? CUDA graphs eliminate kernel launch overhead by pre-compiling GPU operations into reusable graphs. This provides significant performance improvements for inference workloads. ### Configuration The Scaleway deployment automatically enables CUDA graphs with these optimizations: - **`enforce_eager=False`**: Enables CUDA graphs (disabled on HuggingFace for stability) - **`disable_custom_all_reduce=False`**: Enables custom kernels for better performance - **`gpu_memory_utilization=0.85`**: Aggressive memory usage (87% actual utilization) - **Graph Capture**: 67 mixed prefill-decode graphs + 35 decode graphs ### Performance Impact - **20-30% faster inference** compared to eager mode - **Reduced latency** for repeated operations - **Better GPU utilization** (87% vs 75% on HuggingFace) - **Higher concurrency** (37.36x max concurrent requests) ### Verification Check CUDA graphs are working by looking for these log messages: ``` Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 Graph capturing finished in 6 secs, took 0.85 GiB ``` ## Performance Metrics ### Qwen 3 8B on L40S (with CUDA Graphs) - **Load Time** (cold): ~5-10 minutes - **Load Time** (warm): ~30 seconds - **Inference Speed**: ~80-120 tokens/second (20-30% improvement with CUDA graphs) - **Memory Usage**: ~15GB VRAM (87% utilization), ~4GB RAM - **Concurrent Requests**: Up to 37.36x (4K token requests) - **CUDA Graphs**: 67 mixed prefill-decode + 35 decode graphs captured - **Response Times**: ~0.37s simple queries, ~3.5s complex financial analysis ## Cost Optimization ### Development/Testing ```bash # Stop container when not in use docker stop linguacustodia-api # Stop instance via Scaleway console # Billing stops when instance is powered off ``` ### Production - Use `--restart unless-stopped` for automatic recovery - Monitor with `docker stats linguacustodia-api` - Set up CloudWatch/Datadog for alerting ## Troubleshooting ### Container Fails to Start **Symptom**: Container exits immediately **Solution**: ```bash # Check logs docker logs linguacustodia-api # Common issues: # 1. Invalid HuggingFace tokens # 2. Insufficient disk space # 3. GPU not accessible ``` ### "Invalid user token" Error **Symptom**: `ERROR:app:❌ Failed to load model: Invalid user token.` **Solution**: ```bash # Ensure tokens don't have quotes # Recreate container with correct env vars docker rm linguacustodia-api docker run -d --name linguacustodia-api --gpus all \ -p 7860:7860 \ -e HF_TOKEN= \ -e HF_TOKEN_LC= \ ... ``` ### GPU Not Detected **Symptom**: Model loads on CPU **Solution**: ```bash # Verify GPU access docker exec linguacustodia-api nvidia-smi # Ensure --gpus all flag is set docker inspect linguacustodia-api | grep -i gpu ``` ### Out of Memory **Symptom**: `torch.cuda.OutOfMemoryError` **Solution**: 1. Switch to smaller model (`qwen3-8b` or `fin-pythia-1.4b`) 2. Clear GPU cache: ```bash docker restart linguacustodia-api ``` ## Maintenance ### Update Application ```bash # Upload new app.py scp app.py root@:/root/ # Rebuild and restart ssh root@ docker build -f Dockerfile.scaleway -t linguacustodia-api:scaleway . docker stop linguacustodia-api docker rm linguacustodia-api # Run command from step 5 ``` ### Update CUDA Version Edit `Dockerfile.scaleway`: ```dockerfile FROM nvidia/cuda:12.7.0-runtime-ubuntu22.04 # Update version ``` Then rebuild. ### Backup Model Cache ```bash # On L40S instance tar -czf models-backup.tar.gz /root/.cache/huggingface/ scp models-backup.tar.gz user@backup-server:/backups/ ``` ## Security ### Network Security - **Firewall**: Restrict port 7860 to trusted IPs - **SSH**: Use key-based authentication only - **Updates**: Regularly update Ubuntu and Docker ### API Security - **Authentication**: Implement API keys (not included in current version) - **Rate Limiting**: Use nginx/Caddy as reverse proxy - **HTTPS**: Set up Let's Encrypt certificates ### Token Management - Store tokens in `.env` file (never commit to git) - Use Scaleway Secret Manager for production - Rotate tokens regularly ## Monitoring ### Resource Usage ```bash # GPU utilization nvidia-smi -l 1 # Container stats docker stats linguacustodia-api # Disk usage df -h /root/.cache/huggingface ``` ### Application Logs ```bash # Real-time logs docker logs -f linguacustodia-api # Last 100 lines docker logs --tail 100 linguacustodia-api # Filter for errors docker logs linguacustodia-api 2>&1 | grep ERROR ``` ## Comparison: Scaleway vs HuggingFace Spaces | Feature | Scaleway L40S | HuggingFace Spaces | |---------|---------------|-------------------| | **GPU** | L40S (48GB) | A10G (24GB) | | **Control** | Full root access | Limited | | **Cost** | Pay per hour | Free tier + paid | | **Uptime** | 100% (if running) | Variable | | **Setup** | Manual | Automated | | **Scaling** | Manual | Automatic | | **Best For** | Production, large models | Prototyping, demos | ## Cost Estimate **Scaleway L40S Pricing** (as of 2025): - **Per Hour**: ~$1.50-2.00 - **Per Month** (24/7): ~$1,100-1,450 - **Recommended**: Use on-demand, power off when not in use **Example Usage**: - 8 hours/day, 20 days/month: ~$240-320/month - Development/testing only: ~$50-100/month ## Next Steps 1. **Set up monitoring**: Integrate with your monitoring stack 2. **Implement CI/CD**: Automate deployments with GitHub Actions 3. **Add authentication**: Secure the API with JWT tokens 4. **Scale horizontally**: Deploy multiple instances behind a load balancer 5. **Optimize costs**: Use spot instances or reserved capacity ## Support - **Scaleway Documentation**: https://www.scaleway.com/en/docs/compute/gpu/ - **LinguaCustodia Issues**: https://github.com/DealExMachina/llm-pro-fin-api/issues - **NVIDIA Docker**: https://github.com/NVIDIA/nvidia-docker --- **Last Updated**: October 3, 2025 **Deployment Status**: ✅ Production-ready **Instance**: `51.159.152.233` (Paris 2)