Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

App Files Files Community

dragonllm-finance-models / docs /GRACEFUL_SHUTDOWN_SUMMARY.md

jeanbaptdzd

feat: Clean deployment to HuggingFace Space with model config test endpoint

8c0b652 about 1 month ago

preview code

raw

history blame contribute delete

9.76 kB

Graceful Shutdown & Sleep Mode Implementation

Version: 24.1.1
Date: October 4, 2025
Status: ✅ Deployed to HuggingFace L40 Space

🎯 Overview

Implemented graceful shutdown and vLLM sleep mode support to handle HuggingFace Spaces sleep/wake cycles without the EngineCore_DP0 died unexpectedly error.

🛠️ Implementation Details

1. FastAPI Shutdown Event Handler

@app.on_event("shutdown")
async def shutdown_event():
    """Gracefully shutdown the application."""
    global inference_backend
    logger.info("🛑 Starting graceful shutdown...")
    
    try:
        if inference_backend:
            logger.info(f"🧹 Cleaning up {inference_backend.backend_type} backend...")
            inference_backend.cleanup()
            logger.info("✅ Backend cleanup completed")
        
        # Additional cleanup for global variables
        cleanup_model_memory()
        logger.info("✅ Global memory cleanup completed")
        
        logger.info("✅ Graceful shutdown completed successfully")
        
    except Exception as e:
        logger.error(f"❌ Error during shutdown: {e}")
        # Don't raise the exception to avoid preventing shutdown

Key Features:

Calls backend-specific cleanup methods
Clears GPU memory and runs garbage collection
Handles errors gracefully without blocking shutdown
Uses FastAPI's native shutdown event (no signal handlers)

2. vLLM Backend Cleanup

def cleanup(self) -> None:
    """Clean up vLLM resources gracefully."""
    try:
        if self.engine:
            logger.info("🧹 Shutting down vLLM engine...")
            del self.engine
            self.engine = None
            logger.info("✅ vLLM engine reference cleared")
        
        # Clear CUDA cache
        import torch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            logger.info("✅ CUDA cache cleared")
        
        # Force garbage collection
        import gc
        gc.collect()
        logger.info("✅ Garbage collection completed")
        
    except Exception as e:
        logger.error(f"❌ Error during vLLM cleanup: {e}")

Key Features:

Properly deletes vLLM engine references
Clears CUDA cache to free GPU memory
Forces garbage collection
Detailed logging for debugging

3. vLLM Sleep Mode Support

def sleep(self) -> bool:
    """Put vLLM engine into sleep mode (for HuggingFace Spaces)."""
    try:
        if self.engine and hasattr(self.engine, 'sleep'):
            logger.info("😴 Putting vLLM engine to sleep...")
            self.engine.sleep()
            logger.info("✅ vLLM engine is now sleeping (GPU memory released)")
            return True
        else:
            logger.info("ℹ️ vLLM engine doesn't support sleep mode or not loaded")
            return False
    except Exception as e:
        logger.warning(f"⚠️ Error putting vLLM to sleep (non-critical): {e}")
        return False

def wake(self) -> bool:
    """Wake up vLLM engine from sleep mode."""
    try:
        if self.engine and hasattr(self.engine, 'wake'):
            logger.info("🌅 Waking up vLLM engine...")
            self.engine.wake()
            logger.info("✅ vLLM engine is now awake")
            return True
        else:
            logger.info("ℹ️ vLLM engine doesn't support wake mode or not loaded")
            return False
    except Exception as e:
        logger.warning(f"⚠️ Error waking up vLLM (non-critical): {e}")
        return False

Key Features:

Uses vLLM's native sleep mode API (if available)
Releases GPU memory while keeping model in CPU RAM
Much faster wake-up than full model reload
Graceful fallback if sleep mode not supported

4. Manual Control Endpoints

Sleep Endpoint

POST /sleep

Puts the backend into sleep mode, releasing GPU memory.

Response:

{
  "message": "Backend put to sleep successfully",
  "status": "sleeping",
  "backend": "vllm",
  "note": "GPU memory released, ready for HuggingFace Space sleep"
}

Wake Endpoint

POST /wake

Wakes up the backend from sleep mode.

Response:

{
  "message": "Backend woken up successfully",
  "status": "awake",
  "backend": "vllm",
  "note": "Ready for inference"
}

5. Startup Wake-Up Check

if inference_backend.backend_type == "vllm":
    logger.info("🌅 Checking if vLLM needs to wake up from sleep...")
    try:
        wake_success = inference_backend.wake()
        if wake_success:
            logger.info("✅ vLLM wake-up successful")
        else:
            logger.info("ℹ️ vLLM wake-up not needed (fresh startup)")
    except Exception as e:
        logger.info(f"ℹ️ vLLM wake-up check completed (normal on fresh startup): {e}")

Key Features:

Automatically checks if vLLM needs to wake up on startup
Handles both fresh starts and wake-ups from sleep
Non-blocking - continues startup even if wake fails

🚀 How It Works with HuggingFace Spaces

Scenario 1: Space Going to Sleep

HuggingFace Spaces sends shutdown signal
FastAPI's shutdown event handler is triggered
inference_backend.cleanup() is called
vLLM engine is properly shut down
GPU memory is cleared
Space can sleep without errors

Scenario 2: Space Waking Up

User accesses the Space
FastAPI starts up normally
Startup event calls inference_backend.wake()
vLLM restores model to GPU (if applicable)
Ready for inference

Scenario 3: Manual Sleep/Wake

Call POST /sleep to manually put backend to sleep
GPU memory is released
Call POST /wake to restore backend
Resume inference

📊 Expected Behavior

Before Implementation

ERROR 10-04 10:17:40 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.

After Implementation

INFO:app:🛑 Starting graceful shutdown...
INFO:app:🧹 Cleaning up vllm backend...
INFO:app:✅ vLLM engine reference cleared
INFO:app:✅ CUDA cache cleared
INFO:app:✅ Garbage collection completed
INFO:app:✅ Backend cleanup completed
INFO:app:✅ Global memory cleanup completed
INFO:app:✅ Graceful shutdown completed successfully

🔧 Design Decisions

Why No Signal Handlers?

Initially implemented custom signal handlers (SIGTERM, SIGINT), but removed them because:

HuggingFace Infrastructure: HuggingFace Spaces has its own signal handling infrastructure
Conflicts: Custom signal handlers can conflict with the platform's shutdown process
FastAPI Native: FastAPI's @app.on_event("shutdown") is already properly integrated
Simplicity: Fewer moving parts = more reliable

Why Separate Sleep/Wake from Shutdown?

Different Use Cases: Sleep is for temporary pause, shutdown is for termination
Performance: Sleep mode is faster to resume than full restart
Flexibility: Manual control allows testing and optimization
Non-Intrusive: Sleep/wake are optional features that don't affect core functionality

🐛 Issues Fixed

Issue 1: Undefined Variable

Error: NameError: name 'deployment_env' is not defined
Fix: Removed environment check in wake-up call - safe for all backends

Issue 2: Signal Handler Conflicts

Error: Runtime errors on Space startup
Fix: Removed custom signal handlers, rely on FastAPI native events

Issue 3: Logger Initialization Order

Error: Logger used before definition
Fix: Moved signal import after logger setup

📈 Benefits

No More Unexpected Deaths: vLLM engine shuts down cleanly
Faster Wake-Up: Sleep mode preserves model in CPU RAM
Better Resource Management: Proper GPU memory cleanup
Manual Control: API endpoints for testing and debugging
Production Ready: Handles all edge cases gracefully

🧪 Testing

Test Graceful Shutdown

# Check health before shutdown
curl https://your-api-url.hf.space/health

# Wait for Space to go to sleep (or manually stop it)
# Check logs for graceful shutdown messages

Test Sleep/Wake

# Put to sleep
curl -X POST https://your-api-url.hf.space/sleep

# Check backend status
curl https://your-api-url.hf.space/backend

# Wake up
curl -X POST https://your-api-url.hf.space/wake

# Test inference
curl -X POST https://your-api-url.hf.space/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is financial risk?", "max_new_tokens": 50}'

📝 Future Improvements

Automatic Sleep: Auto-sleep after X minutes of inactivity
Sleep Metrics: Track sleep/wake cycles and performance
Progressive Wake: Warm up model gradually
Health Check Integration: Report sleep status in health endpoint

✅ Status

FastAPI shutdown event handler
vLLM cleanup method with logging
vLLM sleep/wake methods
Manual sleep/wake API endpoints
Startup wake-up check
Remove signal handlers (simplification)
Fix undefined variable bug
Deploy to HuggingFace Space
Test on live Space
Monitor for 24 hours
Document in main README

🔗 Related Files

app.py: Main application with shutdown/sleep implementation
PROJECT_RULES.md: Updated with vLLM configuration
docs/VLLM_INTEGRATION.md: vLLM backend documentation
README.md: Project overview and architecture

Graceful Shutdown & Sleep Mode Implementation

🎯 Overview

🛠️ Implementation Details

1. FastAPI Shutdown Event Handler

2. vLLM Backend Cleanup

3. vLLM Sleep Mode Support

4. Manual Control Endpoints

Sleep Endpoint

Wake Endpoint

5. Startup Wake-Up Check

🚀 How It Works with HuggingFace Spaces

Scenario 1: Space Going to Sleep

Scenario 2: Space Waking Up

Scenario 3: Manual Sleep/Wake

📊 Expected Behavior

Before Implementation

After Implementation

🔧 Design Decisions

Why No Signal Handlers?

Why Separate Sleep/Wake from Shutdown?

🐛 Issues Fixed

Issue 1: Undefined Variable

Issue 2: Signal Handler Conflicts

Issue 3: Logger Initialization Order

📈 Benefits

🧪 Testing

Test Graceful Shutdown

Test Sleep/Wake

📝 Future Improvements

✅ Status

🔗 Related Files

📚 References