Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

App Files Files Community

dragonllm-finance-models / docs /vllm-integration.md

jeanbaptdzd

feat: Clean deployment to HuggingFace Space with model config test endpoint

8c0b652 about 1 month ago

preview code

raw

history blame contribute delete

4.95 kB

vLLM Integration Guide

Overview

The LinguaCustodia Financial API now uses vLLM as its primary inference backend on both HuggingFace Spaces and Scaleway L40S instances. This provides significant performance improvements through optimized GPU memory management and inference speed.

Architecture

Backend Abstraction Layer

The application uses a platform-specific backend abstraction that automatically selects the optimal vLLM configuration based on the deployment environment:

class InferenceBackend:
    """Unified interface for all inference backends."""
    - VLLMBackend: High-performance vLLM engine
    - TransformersBackend: Fallback for compatibility

Platform-Specific Configurations

HuggingFace Spaces (L40 GPU - 48GB VRAM)

VLLM_CONFIG_HF = {
    "gpu_memory_utilization": 0.75,  # Conservative (36GB of 48GB)
    "max_model_len": 2048,           # HF-optimized
    "enforce_eager": True,           # No CUDA graphs (HF compatibility)
    "disable_custom_all_reduce": True,  # No custom kernels
    "dtype": "bfloat16",
}

Rationale: HuggingFace Spaces require conservative settings for stability and compatibility.

Scaleway L40S (48GB VRAM)

VLLM_CONFIG_SCW = {
    "gpu_memory_utilization": 0.85,  # Aggressive (40.8GB of 48GB)
    "max_model_len": 4096,           # Full context length
    "enforce_eager": False,          # CUDA graphs enabled
    "disable_custom_all_reduce": False,  # All optimizations
    "dtype": "bfloat16",
}

Rationale: Dedicated instances can use full optimizations for maximum performance.

Deployment

HuggingFace Spaces

Requirements:

Dockerfile with git installed (for pip install from GitHub)
Official vLLM package (vllm>=0.2.0)
Environment variables: DEPLOYMENT_ENV=huggingface, USE_VLLM=true

Current Status:

✅ Fully operational with vLLM
✅ L40 GPU (48GB VRAM)
✅ Eager mode for stability
✅ All endpoints functional

Scaleway L40S

Requirements:

NVIDIA CUDA base image (nvidia/cuda:12.6.3-runtime-ubuntu22.04)
Official vLLM package (vllm>=0.2.0)
Environment variables: DEPLOYMENT_ENV=scaleway, USE_VLLM=true

Current Status:

✅ Ready for deployment
✅ Full CUDA graph optimizations
✅ Maximum performance configuration

API Endpoints

Standard Endpoints

POST /inference - Standard inference with vLLM backend
GET /health - Health check with backend information
GET /backend - Backend configuration details
GET /models - List available models

OpenAI-Compatible Endpoints

POST /v1/chat/completions - OpenAI chat completion format
POST /v1/completions - OpenAI text completion format
GET /v1/models - List models in OpenAI format

Performance Metrics

HuggingFace Spaces (L40 GPU)

GPU Memory: 36GB utilized (75% of 48GB)
KV Cache: 139,968 tokens
Max Concurrency: 68.34x for 2,048 token requests
Model Load Time: ~27 seconds
Inference Speed: Fast with eager mode

Benefits Over Transformers Backend

Memory Efficiency: 30-40% better GPU utilization
Throughput: Higher concurrent request handling
Batching: Continuous batching for multiple requests
API Compatibility: OpenAI-compatible endpoints

Troubleshooting

Common Issues

1. Build Errors (HuggingFace)

Issue: Missing git in Dockerfile
Solution: Add git to apt-get install in Dockerfile

2. CUDA Compilation Errors

Issue: Attempting to build from source without compiler
Solution: Use official pre-compiled wheels (vllm>=0.2.0)

3. Memory Issues

Issue: OOM errors on model load
Solution: Reduce gpu_memory_utilization or max_model_len

4. ModelInfo Attribute Errors

Issue: Using .get() on ModelInfo objects
Solution: Use getattr() instead of .get()

Configuration Reference

Environment Variables

# Deployment configuration
DEPLOYMENT_ENV=huggingface  # or 'scaleway'
USE_VLLM=true

# Model selection
MODEL_NAME=llama3.1-8b  # Default model

# Storage
HF_HOME=/data/.huggingface

# Authentication
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_token

Requirements Files

requirements.txt - HuggingFace (default with official vLLM)
requirements-hf.txt - HuggingFace-specific
requirements-scaleway.txt - Scaleway-specific

Future Enhancements

Implement streaming responses
Add request queueing and rate limiting
Optimize KV cache settings per model
Add metrics and monitoring endpoints
Support for multi-GPU setups

References

Last Updated: October 4, 2025 Version: 24.1.0 Status: Production Ready ✅