Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

App Files Files Community

dragonllm-finance-models / docs /l40-gpu-limitations.md

jeanbaptdzd

feat: Clean deployment to HuggingFace Space with model config test endpoint

8c0b652 about 2 months ago

preview code

raw

history blame contribute delete

3.59 kB

L40 GPU Limitations and Model Compatibility

🚨 Important: L40 GPU Memory Constraints

The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't.

✅ Compatible Models (Recommended)

8B Parameter Models

Llama 3.1 8B Financial - ✅ Recommended
Qwen 3 8B Financial - ✅ Recommended

Memory Usage: ~24-28GB total (model weights + KV caches + buffers) Performance: Excellent inference speed and quality

Smaller Models

Fin-Pythia 1.4B Financial - ✅ Works perfectly Memory Usage: ~6-8GB total Performance: Very fast inference

❌ Incompatible Models

12B+ Parameter Models

Gemma 3 12B Financial - ❌ Too large for L40
Llama 3.1 70B Financial - ❌ Too large for L40

🔍 Technical Analysis

Why 12B+ Models Fail

Model Weights: Load successfully (~22GB for Gemma 12B)
KV Cache Allocation: Fails during vLLM engine initialization
Memory Requirements: Need ~45-50GB total (exceeds 48GB VRAM)
Error: EngineCore failed to start during determine_available_memory()

Memory Breakdown (Gemma 12B)

Model weights:        ~22GB ✅ (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)

Memory Breakdown (8B Models)

Model weights:        ~16GB ✅
KV caches:           ~8GB  ✅
Inference buffers:   ~4GB  ✅
System overhead:     ~2GB  ✅
Total used:          ~30GB (fits comfortably)

🎯 Recommendations

For L40 GPU Deployment

Use 8B models: Llama 3.1 8B or Qwen 3 8B
Avoid 12B+ models: They will fail during initialization
Test locally first: Verify model compatibility before deployment

For Larger Models

Use A100 GPU: 80GB VRAM can handle 12B+ models
Use multiple GPUs: Distribute model across multiple L40s
Use CPU inference: For testing (much slower)

🔧 Configuration Notes

The application includes experimental configurations for 12B+ models with extremely conservative settings:

gpu_memory_utilization: 0.50 (50% of 48GB = 24GB)
max_model_len: 256 (very short context)
max_num_batched_tokens: 256 (minimal batching)

⚠️ Warning: These settings are experimental and may still fail due to fundamental memory constraints.

📊 Performance Comparison

Model	Parameters	L40 Status	Inference Speed	Quality
Fin-Pythia 1.4B	1.4B	✅ Works	Very Fast	Good
Llama 3.1 8B	8B	✅ Works	Fast	Excellent
Qwen 3 8B	8B	✅ Works	Fast	Excellent
Gemma 3 12B	12B	❌ Fails	N/A	N/A
Llama 3.1 70B	70B	❌ Fails	N/A	N/A

🚀 Best Practices

Start with 8B models: They provide the best balance of performance and compatibility
Monitor memory usage: Use /health endpoint to check GPU memory
Test model switching: Verify /load-model works with compatible models
Document failures: Keep track of which models fail and why

🔗 Related Documentation

README.md - Main project documentation
README_HF_SPACE.md - HuggingFace Space setup
DEPLOYMENT_SUCCESS_SUMMARY.md - Deployment results