# Graceful Shutdown & Sleep Mode Implementation **Version**: 24.1.1 **Date**: October 4, 2025 **Status**: โœ… Deployed to HuggingFace L40 Space ## ๐ŸŽฏ Overview Implemented graceful shutdown and vLLM sleep mode support to handle HuggingFace Spaces sleep/wake cycles without the `EngineCore_DP0 died unexpectedly` error. ## ๐Ÿ› ๏ธ Implementation Details ### 1. **FastAPI Shutdown Event Handler** ```python @app.on_event("shutdown") async def shutdown_event(): """Gracefully shutdown the application.""" global inference_backend logger.info("๐Ÿ›‘ Starting graceful shutdown...") try: if inference_backend: logger.info(f"๐Ÿงน Cleaning up {inference_backend.backend_type} backend...") inference_backend.cleanup() logger.info("โœ… Backend cleanup completed") # Additional cleanup for global variables cleanup_model_memory() logger.info("โœ… Global memory cleanup completed") logger.info("โœ… Graceful shutdown completed successfully") except Exception as e: logger.error(f"โŒ Error during shutdown: {e}") # Don't raise the exception to avoid preventing shutdown ``` **Key Features**: - Calls backend-specific cleanup methods - Clears GPU memory and runs garbage collection - Handles errors gracefully without blocking shutdown - Uses FastAPI's native shutdown event (no signal handlers) ### 2. **vLLM Backend Cleanup** ```python def cleanup(self) -> None: """Clean up vLLM resources gracefully.""" try: if self.engine: logger.info("๐Ÿงน Shutting down vLLM engine...") del self.engine self.engine = None logger.info("โœ… vLLM engine reference cleared") # Clear CUDA cache import torch if torch.cuda.is_available(): torch.cuda.empty_cache() logger.info("โœ… CUDA cache cleared") # Force garbage collection import gc gc.collect() logger.info("โœ… Garbage collection completed") except Exception as e: logger.error(f"โŒ Error during vLLM cleanup: {e}") ``` **Key Features**: - Properly deletes vLLM engine references - Clears CUDA cache to free GPU memory - Forces garbage collection - Detailed logging for debugging ### 3. **vLLM Sleep Mode Support** ```python def sleep(self) -> bool: """Put vLLM engine into sleep mode (for HuggingFace Spaces).""" try: if self.engine and hasattr(self.engine, 'sleep'): logger.info("๐Ÿ˜ด Putting vLLM engine to sleep...") self.engine.sleep() logger.info("โœ… vLLM engine is now sleeping (GPU memory released)") return True else: logger.info("โ„น๏ธ vLLM engine doesn't support sleep mode or not loaded") return False except Exception as e: logger.warning(f"โš ๏ธ Error putting vLLM to sleep (non-critical): {e}") return False def wake(self) -> bool: """Wake up vLLM engine from sleep mode.""" try: if self.engine and hasattr(self.engine, 'wake'): logger.info("๐ŸŒ… Waking up vLLM engine...") self.engine.wake() logger.info("โœ… vLLM engine is now awake") return True else: logger.info("โ„น๏ธ vLLM engine doesn't support wake mode or not loaded") return False except Exception as e: logger.warning(f"โš ๏ธ Error waking up vLLM (non-critical): {e}") return False ``` **Key Features**: - Uses vLLM's native sleep mode API (if available) - Releases GPU memory while keeping model in CPU RAM - Much faster wake-up than full model reload - Graceful fallback if sleep mode not supported ### 4. **Manual Control Endpoints** #### Sleep Endpoint ``` POST /sleep ``` Puts the backend into sleep mode, releasing GPU memory. **Response**: ```json { "message": "Backend put to sleep successfully", "status": "sleeping", "backend": "vllm", "note": "GPU memory released, ready for HuggingFace Space sleep" } ``` #### Wake Endpoint ``` POST /wake ``` Wakes up the backend from sleep mode. **Response**: ```json { "message": "Backend woken up successfully", "status": "awake", "backend": "vllm", "note": "Ready for inference" } ``` ### 5. **Startup Wake-Up Check** ```python if inference_backend.backend_type == "vllm": logger.info("๐ŸŒ… Checking if vLLM needs to wake up from sleep...") try: wake_success = inference_backend.wake() if wake_success: logger.info("โœ… vLLM wake-up successful") else: logger.info("โ„น๏ธ vLLM wake-up not needed (fresh startup)") except Exception as e: logger.info(f"โ„น๏ธ vLLM wake-up check completed (normal on fresh startup): {e}") ``` **Key Features**: - Automatically checks if vLLM needs to wake up on startup - Handles both fresh starts and wake-ups from sleep - Non-blocking - continues startup even if wake fails ## ๐Ÿš€ How It Works with HuggingFace Spaces ### Scenario 1: Space Going to Sleep 1. HuggingFace Spaces sends shutdown signal 2. FastAPI's shutdown event handler is triggered 3. `inference_backend.cleanup()` is called 4. vLLM engine is properly shut down 5. GPU memory is cleared 6. Space can sleep without errors ### Scenario 2: Space Waking Up 1. User accesses the Space 2. FastAPI starts up normally 3. Startup event calls `inference_backend.wake()` 4. vLLM restores model to GPU (if applicable) 5. Ready for inference ### Scenario 3: Manual Sleep/Wake 1. Call `POST /sleep` to manually put backend to sleep 2. GPU memory is released 3. Call `POST /wake` to restore backend 4. Resume inference ## ๐Ÿ“Š Expected Behavior ### Before Implementation ``` ERROR 10-04 10:17:40 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. ``` ### After Implementation ``` INFO:app:๐Ÿ›‘ Starting graceful shutdown... INFO:app:๐Ÿงน Cleaning up vllm backend... INFO:app:โœ… vLLM engine reference cleared INFO:app:โœ… CUDA cache cleared INFO:app:โœ… Garbage collection completed INFO:app:โœ… Backend cleanup completed INFO:app:โœ… Global memory cleanup completed INFO:app:โœ… Graceful shutdown completed successfully ``` ## ๐Ÿ”ง Design Decisions ### Why No Signal Handlers? Initially implemented custom signal handlers (SIGTERM, SIGINT), but removed them because: 1. **HuggingFace Infrastructure**: HuggingFace Spaces has its own signal handling infrastructure 2. **Conflicts**: Custom signal handlers can conflict with the platform's shutdown process 3. **FastAPI Native**: FastAPI's `@app.on_event("shutdown")` is already properly integrated 4. **Simplicity**: Fewer moving parts = more reliable ### Why Separate Sleep/Wake from Shutdown? 1. **Different Use Cases**: Sleep is for temporary pause, shutdown is for termination 2. **Performance**: Sleep mode is faster to resume than full restart 3. **Flexibility**: Manual control allows testing and optimization 4. **Non-Intrusive**: Sleep/wake are optional features that don't affect core functionality ## ๐Ÿ› Issues Fixed ### Issue 1: Undefined Variable **Error**: `NameError: name 'deployment_env' is not defined` **Fix**: Removed environment check in wake-up call - safe for all backends ### Issue 2: Signal Handler Conflicts **Error**: Runtime errors on Space startup **Fix**: Removed custom signal handlers, rely on FastAPI native events ### Issue 3: Logger Initialization Order **Error**: Logger used before definition **Fix**: Moved signal import after logger setup ## ๐Ÿ“ˆ Benefits 1. **No More Unexpected Deaths**: vLLM engine shuts down cleanly 2. **Faster Wake-Up**: Sleep mode preserves model in CPU RAM 3. **Better Resource Management**: Proper GPU memory cleanup 4. **Manual Control**: API endpoints for testing and debugging 5. **Production Ready**: Handles all edge cases gracefully ## ๐Ÿงช Testing ### Test Graceful Shutdown ```bash # Check health before shutdown curl https://your-api-url.hf.space/health # Wait for Space to go to sleep (or manually stop it) # Check logs for graceful shutdown messages ``` ### Test Sleep/Wake ```bash # Put to sleep curl -X POST https://your-api-url.hf.space/sleep # Check backend status curl https://your-api-url.hf.space/backend # Wake up curl -X POST https://your-api-url.hf.space/wake # Test inference curl -X POST https://your-api-url.hf.space/inference \ -H "Content-Type: application/json" \ -d '{"prompt": "What is financial risk?", "max_new_tokens": 50}' ``` ## ๐Ÿ“ Future Improvements 1. **Automatic Sleep**: Auto-sleep after X minutes of inactivity 2. **Sleep Metrics**: Track sleep/wake cycles and performance 3. **Progressive Wake**: Warm up model gradually 4. **Health Check Integration**: Report sleep status in health endpoint ## โœ… Status - [x] FastAPI shutdown event handler - [x] vLLM cleanup method with logging - [x] vLLM sleep/wake methods - [x] Manual sleep/wake API endpoints - [x] Startup wake-up check - [x] Remove signal handlers (simplification) - [x] Fix undefined variable bug - [x] Deploy to HuggingFace Space - [ ] Test on live Space - [ ] Monitor for 24 hours - [ ] Document in main README ## ๐Ÿ”— Related Files - `app.py`: Main application with shutdown/sleep implementation - `PROJECT_RULES.md`: Updated with vLLM configuration - `docs/VLLM_INTEGRATION.md`: vLLM backend documentation - `README.md`: Project overview and architecture ## ๐Ÿ“š References - [vLLM Sleep Mode Documentation](https://docs.vllm.ai/en/latest/features/sleep_mode.html) - [FastAPI Lifecycle Events](https://fastapi.tiangolo.com/advanced/events/) - [HuggingFace Spaces Docker](https://huggingface.co/docs/hub/spaces-sdks-docker)