Spaces:
Runtime error
vLLM Integration Guide
Overview
The LinguaCustodia Financial API now uses vLLM as its primary inference backend on both HuggingFace Spaces and Scaleway L40S instances. This provides significant performance improvements through optimized GPU memory management and inference speed.
Architecture
Backend Abstraction Layer
The application uses a platform-specific backend abstraction that automatically selects the optimal vLLM configuration based on the deployment environment:
class InferenceBackend:
"""Unified interface for all inference backends."""
- VLLMBackend: High-performance vLLM engine
- TransformersBackend: Fallback for compatibility
Platform-Specific Configurations
HuggingFace Spaces (L40 GPU - 48GB VRAM)
VLLM_CONFIG_HF = {
"gpu_memory_utilization": 0.75, # Conservative (36GB of 48GB)
"max_model_len": 2048, # HF-optimized
"enforce_eager": True, # No CUDA graphs (HF compatibility)
"disable_custom_all_reduce": True, # No custom kernels
"dtype": "bfloat16",
}
Rationale: HuggingFace Spaces require conservative settings for stability and compatibility.
Scaleway L40S (48GB VRAM)
VLLM_CONFIG_SCW = {
"gpu_memory_utilization": 0.85, # Aggressive (40.8GB of 48GB)
"max_model_len": 4096, # Full context length
"enforce_eager": False, # CUDA graphs enabled
"disable_custom_all_reduce": False, # All optimizations
"dtype": "bfloat16",
}
Rationale: Dedicated instances can use full optimizations for maximum performance.
Deployment
HuggingFace Spaces
Requirements:
- Dockerfile with
gitinstalled (for pip install from GitHub) - Official vLLM package (
vllm>=0.2.0) - Environment variables:
DEPLOYMENT_ENV=huggingface,USE_VLLM=true
Current Status:
- β Fully operational with vLLM
- β L40 GPU (48GB VRAM)
- β Eager mode for stability
- β All endpoints functional
Scaleway L40S
Requirements:
- NVIDIA CUDA base image (nvidia/cuda:12.6.3-runtime-ubuntu22.04)
- Official vLLM package (
vllm>=0.2.0) - Environment variables:
DEPLOYMENT_ENV=scaleway,USE_VLLM=true
Current Status:
- β Ready for deployment
- β Full CUDA graph optimizations
- β Maximum performance configuration
API Endpoints
Standard Endpoints
POST /inference- Standard inference with vLLM backendGET /health- Health check with backend informationGET /backend- Backend configuration detailsGET /models- List available models
OpenAI-Compatible Endpoints
POST /v1/chat/completions- OpenAI chat completion formatPOST /v1/completions- OpenAI text completion formatGET /v1/models- List models in OpenAI format
Performance Metrics
HuggingFace Spaces (L40 GPU)
- GPU Memory: 36GB utilized (75% of 48GB)
- KV Cache: 139,968 tokens
- Max Concurrency: 68.34x for 2,048 token requests
- Model Load Time: ~27 seconds
- Inference Speed: Fast with eager mode
Benefits Over Transformers Backend
- Memory Efficiency: 30-40% better GPU utilization
- Throughput: Higher concurrent request handling
- Batching: Continuous batching for multiple requests
- API Compatibility: OpenAI-compatible endpoints
Troubleshooting
Common Issues
1. Build Errors (HuggingFace)
- Issue: Missing
gitin Dockerfile - Solution: Add
gitto apt-get install in Dockerfile
2. CUDA Compilation Errors
- Issue: Attempting to build from source without compiler
- Solution: Use official pre-compiled wheels (
vllm>=0.2.0)
3. Memory Issues
- Issue: OOM errors on model load
- Solution: Reduce
gpu_memory_utilizationormax_model_len
4. ModelInfo Attribute Errors
- Issue: Using
.get()on ModelInfo objects - Solution: Use
getattr()instead of.get()
Configuration Reference
Environment Variables
# Deployment configuration
DEPLOYMENT_ENV=huggingface # or 'scaleway'
USE_VLLM=true
# Model selection
MODEL_NAME=llama3.1-8b # Default model
# Storage
HF_HOME=/data/.huggingface
# Authentication
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_token
Requirements Files
requirements.txt- HuggingFace (default with official vLLM)requirements-hf.txt- HuggingFace-specificrequirements-scaleway.txt- Scaleway-specific
Future Enhancements
- Implement streaming responses
- Add request queueing and rate limiting
- Optimize KV cache settings per model
- Add metrics and monitoring endpoints
- Support for multi-GPU setups
References
Last Updated: October 4, 2025 Version: 24.1.0 Status: Production Ready β