Spaces:
Runtime error
Graceful Shutdown & Sleep Mode Implementation
Version: 24.1.1
Date: October 4, 2025
Status: β
Deployed to HuggingFace L40 Space
π― Overview
Implemented graceful shutdown and vLLM sleep mode support to handle HuggingFace Spaces sleep/wake cycles without the EngineCore_DP0 died unexpectedly error.
π οΈ Implementation Details
1. FastAPI Shutdown Event Handler
@app.on_event("shutdown")
async def shutdown_event():
"""Gracefully shutdown the application."""
global inference_backend
logger.info("π Starting graceful shutdown...")
try:
if inference_backend:
logger.info(f"π§Ή Cleaning up {inference_backend.backend_type} backend...")
inference_backend.cleanup()
logger.info("β
Backend cleanup completed")
# Additional cleanup for global variables
cleanup_model_memory()
logger.info("β
Global memory cleanup completed")
logger.info("β
Graceful shutdown completed successfully")
except Exception as e:
logger.error(f"β Error during shutdown: {e}")
# Don't raise the exception to avoid preventing shutdown
Key Features:
- Calls backend-specific cleanup methods
- Clears GPU memory and runs garbage collection
- Handles errors gracefully without blocking shutdown
- Uses FastAPI's native shutdown event (no signal handlers)
2. vLLM Backend Cleanup
def cleanup(self) -> None:
"""Clean up vLLM resources gracefully."""
try:
if self.engine:
logger.info("π§Ή Shutting down vLLM engine...")
del self.engine
self.engine = None
logger.info("β
vLLM engine reference cleared")
# Clear CUDA cache
import torch
if torch.cuda.is_available():
torch.cuda.empty_cache()
logger.info("β
CUDA cache cleared")
# Force garbage collection
import gc
gc.collect()
logger.info("β
Garbage collection completed")
except Exception as e:
logger.error(f"β Error during vLLM cleanup: {e}")
Key Features:
- Properly deletes vLLM engine references
- Clears CUDA cache to free GPU memory
- Forces garbage collection
- Detailed logging for debugging
3. vLLM Sleep Mode Support
def sleep(self) -> bool:
"""Put vLLM engine into sleep mode (for HuggingFace Spaces)."""
try:
if self.engine and hasattr(self.engine, 'sleep'):
logger.info("π΄ Putting vLLM engine to sleep...")
self.engine.sleep()
logger.info("β
vLLM engine is now sleeping (GPU memory released)")
return True
else:
logger.info("βΉοΈ vLLM engine doesn't support sleep mode or not loaded")
return False
except Exception as e:
logger.warning(f"β οΈ Error putting vLLM to sleep (non-critical): {e}")
return False
def wake(self) -> bool:
"""Wake up vLLM engine from sleep mode."""
try:
if self.engine and hasattr(self.engine, 'wake'):
logger.info("π
Waking up vLLM engine...")
self.engine.wake()
logger.info("β
vLLM engine is now awake")
return True
else:
logger.info("βΉοΈ vLLM engine doesn't support wake mode or not loaded")
return False
except Exception as e:
logger.warning(f"β οΈ Error waking up vLLM (non-critical): {e}")
return False
Key Features:
- Uses vLLM's native sleep mode API (if available)
- Releases GPU memory while keeping model in CPU RAM
- Much faster wake-up than full model reload
- Graceful fallback if sleep mode not supported
4. Manual Control Endpoints
Sleep Endpoint
POST /sleep
Puts the backend into sleep mode, releasing GPU memory.
Response:
{
"message": "Backend put to sleep successfully",
"status": "sleeping",
"backend": "vllm",
"note": "GPU memory released, ready for HuggingFace Space sleep"
}
Wake Endpoint
POST /wake
Wakes up the backend from sleep mode.
Response:
{
"message": "Backend woken up successfully",
"status": "awake",
"backend": "vllm",
"note": "Ready for inference"
}
5. Startup Wake-Up Check
if inference_backend.backend_type == "vllm":
logger.info("π
Checking if vLLM needs to wake up from sleep...")
try:
wake_success = inference_backend.wake()
if wake_success:
logger.info("β
vLLM wake-up successful")
else:
logger.info("βΉοΈ vLLM wake-up not needed (fresh startup)")
except Exception as e:
logger.info(f"βΉοΈ vLLM wake-up check completed (normal on fresh startup): {e}")
Key Features:
- Automatically checks if vLLM needs to wake up on startup
- Handles both fresh starts and wake-ups from sleep
- Non-blocking - continues startup even if wake fails
π How It Works with HuggingFace Spaces
Scenario 1: Space Going to Sleep
- HuggingFace Spaces sends shutdown signal
- FastAPI's shutdown event handler is triggered
inference_backend.cleanup()is called- vLLM engine is properly shut down
- GPU memory is cleared
- Space can sleep without errors
Scenario 2: Space Waking Up
- User accesses the Space
- FastAPI starts up normally
- Startup event calls
inference_backend.wake() - vLLM restores model to GPU (if applicable)
- Ready for inference
Scenario 3: Manual Sleep/Wake
- Call
POST /sleepto manually put backend to sleep - GPU memory is released
- Call
POST /waketo restore backend - Resume inference
π Expected Behavior
Before Implementation
ERROR 10-04 10:17:40 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
After Implementation
INFO:app:π Starting graceful shutdown...
INFO:app:π§Ή Cleaning up vllm backend...
INFO:app:β
vLLM engine reference cleared
INFO:app:β
CUDA cache cleared
INFO:app:β
Garbage collection completed
INFO:app:β
Backend cleanup completed
INFO:app:β
Global memory cleanup completed
INFO:app:β
Graceful shutdown completed successfully
π§ Design Decisions
Why No Signal Handlers?
Initially implemented custom signal handlers (SIGTERM, SIGINT), but removed them because:
- HuggingFace Infrastructure: HuggingFace Spaces has its own signal handling infrastructure
- Conflicts: Custom signal handlers can conflict with the platform's shutdown process
- FastAPI Native: FastAPI's
@app.on_event("shutdown")is already properly integrated - Simplicity: Fewer moving parts = more reliable
Why Separate Sleep/Wake from Shutdown?
- Different Use Cases: Sleep is for temporary pause, shutdown is for termination
- Performance: Sleep mode is faster to resume than full restart
- Flexibility: Manual control allows testing and optimization
- Non-Intrusive: Sleep/wake are optional features that don't affect core functionality
π Issues Fixed
Issue 1: Undefined Variable
Error: NameError: name 'deployment_env' is not defined
Fix: Removed environment check in wake-up call - safe for all backends
Issue 2: Signal Handler Conflicts
Error: Runtime errors on Space startup
Fix: Removed custom signal handlers, rely on FastAPI native events
Issue 3: Logger Initialization Order
Error: Logger used before definition
Fix: Moved signal import after logger setup
π Benefits
- No More Unexpected Deaths: vLLM engine shuts down cleanly
- Faster Wake-Up: Sleep mode preserves model in CPU RAM
- Better Resource Management: Proper GPU memory cleanup
- Manual Control: API endpoints for testing and debugging
- Production Ready: Handles all edge cases gracefully
π§ͺ Testing
Test Graceful Shutdown
# Check health before shutdown
curl https://your-api-url.hf.space/health
# Wait for Space to go to sleep (or manually stop it)
# Check logs for graceful shutdown messages
Test Sleep/Wake
# Put to sleep
curl -X POST https://your-api-url.hf.space/sleep
# Check backend status
curl https://your-api-url.hf.space/backend
# Wake up
curl -X POST https://your-api-url.hf.space/wake
# Test inference
curl -X POST https://your-api-url.hf.space/inference \
-H "Content-Type: application/json" \
-d '{"prompt": "What is financial risk?", "max_new_tokens": 50}'
π Future Improvements
- Automatic Sleep: Auto-sleep after X minutes of inactivity
- Sleep Metrics: Track sleep/wake cycles and performance
- Progressive Wake: Warm up model gradually
- Health Check Integration: Report sleep status in health endpoint
β Status
- FastAPI shutdown event handler
- vLLM cleanup method with logging
- vLLM sleep/wake methods
- Manual sleep/wake API endpoints
- Startup wake-up check
- Remove signal handlers (simplification)
- Fix undefined variable bug
- Deploy to HuggingFace Space
- Test on live Space
- Monitor for 24 hours
- Document in main README
π Related Files
app.py: Main application with shutdown/sleep implementationPROJECT_RULES.md: Updated with vLLM configurationdocs/VLLM_INTEGRATION.md: vLLM backend documentationREADME.md: Project overview and architecture