dragonllm-finance-models / docs /GRACEFUL_SHUTDOWN_SUMMARY.md
jeanbaptdzd's picture
feat: Clean deployment to HuggingFace Space with model config test endpoint
8c0b652

Graceful Shutdown & Sleep Mode Implementation

Version: 24.1.1
Date: October 4, 2025
Status: βœ… Deployed to HuggingFace L40 Space

🎯 Overview

Implemented graceful shutdown and vLLM sleep mode support to handle HuggingFace Spaces sleep/wake cycles without the EngineCore_DP0 died unexpectedly error.

πŸ› οΈ Implementation Details

1. FastAPI Shutdown Event Handler

@app.on_event("shutdown")
async def shutdown_event():
    """Gracefully shutdown the application."""
    global inference_backend
    logger.info("πŸ›‘ Starting graceful shutdown...")
    
    try:
        if inference_backend:
            logger.info(f"🧹 Cleaning up {inference_backend.backend_type} backend...")
            inference_backend.cleanup()
            logger.info("βœ… Backend cleanup completed")
        
        # Additional cleanup for global variables
        cleanup_model_memory()
        logger.info("βœ… Global memory cleanup completed")
        
        logger.info("βœ… Graceful shutdown completed successfully")
        
    except Exception as e:
        logger.error(f"❌ Error during shutdown: {e}")
        # Don't raise the exception to avoid preventing shutdown

Key Features:

  • Calls backend-specific cleanup methods
  • Clears GPU memory and runs garbage collection
  • Handles errors gracefully without blocking shutdown
  • Uses FastAPI's native shutdown event (no signal handlers)

2. vLLM Backend Cleanup

def cleanup(self) -> None:
    """Clean up vLLM resources gracefully."""
    try:
        if self.engine:
            logger.info("🧹 Shutting down vLLM engine...")
            del self.engine
            self.engine = None
            logger.info("βœ… vLLM engine reference cleared")
        
        # Clear CUDA cache
        import torch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            logger.info("βœ… CUDA cache cleared")
        
        # Force garbage collection
        import gc
        gc.collect()
        logger.info("βœ… Garbage collection completed")
        
    except Exception as e:
        logger.error(f"❌ Error during vLLM cleanup: {e}")

Key Features:

  • Properly deletes vLLM engine references
  • Clears CUDA cache to free GPU memory
  • Forces garbage collection
  • Detailed logging for debugging

3. vLLM Sleep Mode Support

def sleep(self) -> bool:
    """Put vLLM engine into sleep mode (for HuggingFace Spaces)."""
    try:
        if self.engine and hasattr(self.engine, 'sleep'):
            logger.info("😴 Putting vLLM engine to sleep...")
            self.engine.sleep()
            logger.info("βœ… vLLM engine is now sleeping (GPU memory released)")
            return True
        else:
            logger.info("ℹ️ vLLM engine doesn't support sleep mode or not loaded")
            return False
    except Exception as e:
        logger.warning(f"⚠️ Error putting vLLM to sleep (non-critical): {e}")
        return False

def wake(self) -> bool:
    """Wake up vLLM engine from sleep mode."""
    try:
        if self.engine and hasattr(self.engine, 'wake'):
            logger.info("πŸŒ… Waking up vLLM engine...")
            self.engine.wake()
            logger.info("βœ… vLLM engine is now awake")
            return True
        else:
            logger.info("ℹ️ vLLM engine doesn't support wake mode or not loaded")
            return False
    except Exception as e:
        logger.warning(f"⚠️ Error waking up vLLM (non-critical): {e}")
        return False

Key Features:

  • Uses vLLM's native sleep mode API (if available)
  • Releases GPU memory while keeping model in CPU RAM
  • Much faster wake-up than full model reload
  • Graceful fallback if sleep mode not supported

4. Manual Control Endpoints

Sleep Endpoint

POST /sleep

Puts the backend into sleep mode, releasing GPU memory.

Response:

{
  "message": "Backend put to sleep successfully",
  "status": "sleeping",
  "backend": "vllm",
  "note": "GPU memory released, ready for HuggingFace Space sleep"
}

Wake Endpoint

POST /wake

Wakes up the backend from sleep mode.

Response:

{
  "message": "Backend woken up successfully",
  "status": "awake",
  "backend": "vllm",
  "note": "Ready for inference"
}

5. Startup Wake-Up Check

if inference_backend.backend_type == "vllm":
    logger.info("πŸŒ… Checking if vLLM needs to wake up from sleep...")
    try:
        wake_success = inference_backend.wake()
        if wake_success:
            logger.info("βœ… vLLM wake-up successful")
        else:
            logger.info("ℹ️ vLLM wake-up not needed (fresh startup)")
    except Exception as e:
        logger.info(f"ℹ️ vLLM wake-up check completed (normal on fresh startup): {e}")

Key Features:

  • Automatically checks if vLLM needs to wake up on startup
  • Handles both fresh starts and wake-ups from sleep
  • Non-blocking - continues startup even if wake fails

πŸš€ How It Works with HuggingFace Spaces

Scenario 1: Space Going to Sleep

  1. HuggingFace Spaces sends shutdown signal
  2. FastAPI's shutdown event handler is triggered
  3. inference_backend.cleanup() is called
  4. vLLM engine is properly shut down
  5. GPU memory is cleared
  6. Space can sleep without errors

Scenario 2: Space Waking Up

  1. User accesses the Space
  2. FastAPI starts up normally
  3. Startup event calls inference_backend.wake()
  4. vLLM restores model to GPU (if applicable)
  5. Ready for inference

Scenario 3: Manual Sleep/Wake

  1. Call POST /sleep to manually put backend to sleep
  2. GPU memory is released
  3. Call POST /wake to restore backend
  4. Resume inference

πŸ“Š Expected Behavior

Before Implementation

ERROR 10-04 10:17:40 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.

After Implementation

INFO:app:πŸ›‘ Starting graceful shutdown...
INFO:app:🧹 Cleaning up vllm backend...
INFO:app:βœ… vLLM engine reference cleared
INFO:app:βœ… CUDA cache cleared
INFO:app:βœ… Garbage collection completed
INFO:app:βœ… Backend cleanup completed
INFO:app:βœ… Global memory cleanup completed
INFO:app:βœ… Graceful shutdown completed successfully

πŸ”§ Design Decisions

Why No Signal Handlers?

Initially implemented custom signal handlers (SIGTERM, SIGINT), but removed them because:

  1. HuggingFace Infrastructure: HuggingFace Spaces has its own signal handling infrastructure
  2. Conflicts: Custom signal handlers can conflict with the platform's shutdown process
  3. FastAPI Native: FastAPI's @app.on_event("shutdown") is already properly integrated
  4. Simplicity: Fewer moving parts = more reliable

Why Separate Sleep/Wake from Shutdown?

  1. Different Use Cases: Sleep is for temporary pause, shutdown is for termination
  2. Performance: Sleep mode is faster to resume than full restart
  3. Flexibility: Manual control allows testing and optimization
  4. Non-Intrusive: Sleep/wake are optional features that don't affect core functionality

πŸ› Issues Fixed

Issue 1: Undefined Variable

Error: NameError: name 'deployment_env' is not defined
Fix: Removed environment check in wake-up call - safe for all backends

Issue 2: Signal Handler Conflicts

Error: Runtime errors on Space startup
Fix: Removed custom signal handlers, rely on FastAPI native events

Issue 3: Logger Initialization Order

Error: Logger used before definition
Fix: Moved signal import after logger setup

πŸ“ˆ Benefits

  1. No More Unexpected Deaths: vLLM engine shuts down cleanly
  2. Faster Wake-Up: Sleep mode preserves model in CPU RAM
  3. Better Resource Management: Proper GPU memory cleanup
  4. Manual Control: API endpoints for testing and debugging
  5. Production Ready: Handles all edge cases gracefully

πŸ§ͺ Testing

Test Graceful Shutdown

# Check health before shutdown
curl https://your-api-url.hf.space/health

# Wait for Space to go to sleep (or manually stop it)
# Check logs for graceful shutdown messages

Test Sleep/Wake

# Put to sleep
curl -X POST https://your-api-url.hf.space/sleep

# Check backend status
curl https://your-api-url.hf.space/backend

# Wake up
curl -X POST https://your-api-url.hf.space/wake

# Test inference
curl -X POST https://your-api-url.hf.space/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is financial risk?", "max_new_tokens": 50}'

πŸ“ Future Improvements

  1. Automatic Sleep: Auto-sleep after X minutes of inactivity
  2. Sleep Metrics: Track sleep/wake cycles and performance
  3. Progressive Wake: Warm up model gradually
  4. Health Check Integration: Report sleep status in health endpoint

βœ… Status

  • FastAPI shutdown event handler
  • vLLM cleanup method with logging
  • vLLM sleep/wake methods
  • Manual sleep/wake API endpoints
  • Startup wake-up check
  • Remove signal handlers (simplification)
  • Fix undefined variable bug
  • Deploy to HuggingFace Space
  • Test on live Space
  • Monitor for 24 hours
  • Document in main README

πŸ”— Related Files

  • app.py: Main application with shutdown/sleep implementation
  • PROJECT_RULES.md: Updated with vLLM configuration
  • docs/VLLM_INTEGRATION.md: vLLM backend documentation
  • README.md: Project overview and architecture

πŸ“š References