Spaces:
Runtime error
Runtime error
L40 GPU Limitations and Model Compatibility
π¨ Important: L40 GPU Memory Constraints
The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't.
β Compatible Models (Recommended)
8B Parameter Models
- Llama 3.1 8B Financial - β Recommended
- Qwen 3 8B Financial - β Recommended
Memory Usage: ~24-28GB total (model weights + KV caches + buffers) Performance: Excellent inference speed and quality
Smaller Models
- Fin-Pythia 1.4B Financial - β Works perfectly Memory Usage: ~6-8GB total Performance: Very fast inference
β Incompatible Models
12B+ Parameter Models
- Gemma 3 12B Financial - β Too large for L40
- Llama 3.1 70B Financial - β Too large for L40
π Technical Analysis
Why 12B+ Models Fail
- Model Weights: Load successfully (~22GB for Gemma 12B)
- KV Cache Allocation: Fails during vLLM engine initialization
- Memory Requirements: Need ~45-50GB total (exceeds 48GB VRAM)
- Error:
EngineCore failed to startduringdetermine_available_memory()
Memory Breakdown (Gemma 12B)
Model weights: ~22GB β
(loads successfully)
KV caches: ~15GB β (allocation fails)
Inference buffers: ~8GB β (allocation fails)
System overhead: ~3GB β (allocation fails)
Total needed: ~48GB (exceeds L40 capacity)
Memory Breakdown (8B Models)
Model weights: ~16GB β
KV caches: ~8GB β
Inference buffers: ~4GB β
System overhead: ~2GB β
Total used: ~30GB (fits comfortably)
π― Recommendations
For L40 GPU Deployment
- Use 8B models: Llama 3.1 8B or Qwen 3 8B
- Avoid 12B+ models: They will fail during initialization
- Test locally first: Verify model compatibility before deployment
For Larger Models
- Use A100 GPU: 80GB VRAM can handle 12B+ models
- Use multiple GPUs: Distribute model across multiple L40s
- Use CPU inference: For testing (much slower)
π§ Configuration Notes
The application includes experimental configurations for 12B+ models with extremely conservative settings:
gpu_memory_utilization: 0.50(50% of 48GB = 24GB)max_model_len: 256(very short context)max_num_batched_tokens: 256(minimal batching)
β οΈ Warning: These settings are experimental and may still fail due to fundamental memory constraints.
π Performance Comparison
| Model | Parameters | L40 Status | Inference Speed | Quality |
|---|---|---|---|---|
| Fin-Pythia 1.4B | 1.4B | β Works | Very Fast | Good |
| Llama 3.1 8B | 8B | β Works | Fast | Excellent |
| Qwen 3 8B | 8B | β Works | Fast | Excellent |
| Gemma 3 12B | 12B | β Fails | N/A | N/A |
| Llama 3.1 70B | 70B | β Fails | N/A | N/A |
π Best Practices
- Start with 8B models: They provide the best balance of performance and compatibility
- Monitor memory usage: Use
/healthendpoint to check GPU memory - Test model switching: Verify
/load-modelworks with compatible models - Document failures: Keep track of which models fail and why
π Related Documentation
- README.md - Main project documentation
- README_HF_SPACE.md - HuggingFace Space setup
- DEPLOYMENT_SUCCESS_SUMMARY.md - Deployment results