dragonllm-finance-models / docs /l40-gpu-limitations.md
jeanbaptdzd's picture
feat: Clean deployment to HuggingFace Space with model config test endpoint
8c0b652

L40 GPU Limitations and Model Compatibility

🚨 Important: L40 GPU Memory Constraints

The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't.

βœ… Compatible Models (Recommended)

8B Parameter Models

  • Llama 3.1 8B Financial - βœ… Recommended
  • Qwen 3 8B Financial - βœ… Recommended

Memory Usage: ~24-28GB total (model weights + KV caches + buffers) Performance: Excellent inference speed and quality

Smaller Models

  • Fin-Pythia 1.4B Financial - βœ… Works perfectly Memory Usage: ~6-8GB total Performance: Very fast inference

❌ Incompatible Models

12B+ Parameter Models

  • Gemma 3 12B Financial - ❌ Too large for L40
  • Llama 3.1 70B Financial - ❌ Too large for L40

πŸ” Technical Analysis

Why 12B+ Models Fail

  1. Model Weights: Load successfully (~22GB for Gemma 12B)
  2. KV Cache Allocation: Fails during vLLM engine initialization
  3. Memory Requirements: Need ~45-50GB total (exceeds 48GB VRAM)
  4. Error: EngineCore failed to start during determine_available_memory()

Memory Breakdown (Gemma 12B)

Model weights:        ~22GB βœ… (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)

Memory Breakdown (8B Models)

Model weights:        ~16GB βœ…
KV caches:           ~8GB  βœ…
Inference buffers:   ~4GB  βœ…
System overhead:     ~2GB  βœ…
Total used:          ~30GB (fits comfortably)

🎯 Recommendations

For L40 GPU Deployment

  1. Use 8B models: Llama 3.1 8B or Qwen 3 8B
  2. Avoid 12B+ models: They will fail during initialization
  3. Test locally first: Verify model compatibility before deployment

For Larger Models

  • Use A100 GPU: 80GB VRAM can handle 12B+ models
  • Use multiple GPUs: Distribute model across multiple L40s
  • Use CPU inference: For testing (much slower)

πŸ”§ Configuration Notes

The application includes experimental configurations for 12B+ models with extremely conservative settings:

  • gpu_memory_utilization: 0.50 (50% of 48GB = 24GB)
  • max_model_len: 256 (very short context)
  • max_num_batched_tokens: 256 (minimal batching)

⚠️ Warning: These settings are experimental and may still fail due to fundamental memory constraints.

πŸ“Š Performance Comparison

Model Parameters L40 Status Inference Speed Quality
Fin-Pythia 1.4B 1.4B βœ… Works Very Fast Good
Llama 3.1 8B 8B βœ… Works Fast Excellent
Qwen 3 8B 8B βœ… Works Fast Excellent
Gemma 3 12B 12B ❌ Fails N/A N/A
Llama 3.1 70B 70B ❌ Fails N/A N/A

πŸš€ Best Practices

  1. Start with 8B models: They provide the best balance of performance and compatibility
  2. Monitor memory usage: Use /health endpoint to check GPU memory
  3. Test model switching: Verify /load-model works with compatible models
  4. Document failures: Keep track of which models fail and why

πŸ”— Related Documentation