Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

File size: 9,757 Bytes

8c0b652

# Graceful Shutdown & Sleep Mode Implementation

**Version**: 24.1.1  
**Date**: October 4, 2025  
**Status**: ✅ Deployed to HuggingFace L40 Space

## 🎯 Overview

Implemented graceful shutdown and vLLM sleep mode support to handle HuggingFace Spaces sleep/wake cycles without the `EngineCore_DP0 died unexpectedly` error.

## 🛠️ Implementation Details

### 1. **FastAPI Shutdown Event Handler**

```python
@app.on_event("shutdown")
async def shutdown_event():
    """Gracefully shutdown the application."""
    global inference_backend
    logger.info("🛑 Starting graceful shutdown...")
    
    try:
        if inference_backend:
            logger.info(f"🧹 Cleaning up {inference_backend.backend_type} backend...")
            inference_backend.cleanup()
            logger.info("✅ Backend cleanup completed")
        
        # Additional cleanup for global variables
        cleanup_model_memory()
        logger.info("✅ Global memory cleanup completed")
        
        logger.info("✅ Graceful shutdown completed successfully")
        
    except Exception as e:
        logger.error(f"❌ Error during shutdown: {e}")
        # Don't raise the exception to avoid preventing shutdown
```

**Key Features**:
- Calls backend-specific cleanup methods
- Clears GPU memory and runs garbage collection
- Handles errors gracefully without blocking shutdown
- Uses FastAPI's native shutdown event (no signal handlers)

### 2. **vLLM Backend Cleanup**

```python
def cleanup(self) -> None:
    """Clean up vLLM resources gracefully."""
    try:
        if self.engine:
            logger.info("🧹 Shutting down vLLM engine...")
            del self.engine
            self.engine = None
            logger.info("✅ vLLM engine reference cleared")
        
        # Clear CUDA cache
        import torch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            logger.info("✅ CUDA cache cleared")
        
        # Force garbage collection
        import gc
        gc.collect()
        logger.info("✅ Garbage collection completed")
        
    except Exception as e:
        logger.error(f"❌ Error during vLLM cleanup: {e}")
```

**Key Features**:
- Properly deletes vLLM engine references
- Clears CUDA cache to free GPU memory
- Forces garbage collection
- Detailed logging for debugging

### 3. **vLLM Sleep Mode Support**

```python
def sleep(self) -> bool:
    """Put vLLM engine into sleep mode (for HuggingFace Spaces)."""
    try:
        if self.engine and hasattr(self.engine, 'sleep'):
            logger.info("😴 Putting vLLM engine to sleep...")
            self.engine.sleep()
            logger.info("✅ vLLM engine is now sleeping (GPU memory released)")
            return True
        else:
            logger.info("ℹ️ vLLM engine doesn't support sleep mode or not loaded")
            return False
    except Exception as e:
        logger.warning(f"⚠️ Error putting vLLM to sleep (non-critical): {e}")
        return False

def wake(self) -> bool:
    """Wake up vLLM engine from sleep mode."""
    try:
        if self.engine and hasattr(self.engine, 'wake'):
            logger.info("🌅 Waking up vLLM engine...")
            self.engine.wake()
            logger.info("✅ vLLM engine is now awake")
            return True
        else:
            logger.info("ℹ️ vLLM engine doesn't support wake mode or not loaded")
            return False
    except Exception as e:
        logger.warning(f"⚠️ Error waking up vLLM (non-critical): {e}")
        return False
```

**Key Features**:
- Uses vLLM's native sleep mode API (if available)
- Releases GPU memory while keeping model in CPU RAM
- Much faster wake-up than full model reload
- Graceful fallback if sleep mode not supported

### 4. **Manual Control Endpoints**

#### Sleep Endpoint
```
POST /sleep
```

Puts the backend into sleep mode, releasing GPU memory.

**Response**:
```json
{
  "message": "Backend put to sleep successfully",
  "status": "sleeping",
  "backend": "vllm",
  "note": "GPU memory released, ready for HuggingFace Space sleep"
}
```

#### Wake Endpoint
```
POST /wake
```

Wakes up the backend from sleep mode.

**Response**:
```json
{
  "message": "Backend woken up successfully",
  "status": "awake",
  "backend": "vllm",
  "note": "Ready for inference"
}
```

### 5. **Startup Wake-Up Check**

```python
if inference_backend.backend_type == "vllm":
    logger.info("🌅 Checking if vLLM needs to wake up from sleep...")
    try:
        wake_success = inference_backend.wake()
        if wake_success:
            logger.info("✅ vLLM wake-up successful")
        else:
            logger.info("ℹ️ vLLM wake-up not needed (fresh startup)")
    except Exception as e:
        logger.info(f"ℹ️ vLLM wake-up check completed (normal on fresh startup): {e}")
```

**Key Features**:
- Automatically checks if vLLM needs to wake up on startup
- Handles both fresh starts and wake-ups from sleep
- Non-blocking - continues startup even if wake fails

## 🚀 How It Works with HuggingFace Spaces

### Scenario 1: Space Going to Sleep

1. HuggingFace Spaces sends shutdown signal
2. FastAPI's shutdown event handler is triggered
3. `inference_backend.cleanup()` is called
4. vLLM engine is properly shut down
5. GPU memory is cleared
6. Space can sleep without errors

### Scenario 2: Space Waking Up

1. User accesses the Space
2. FastAPI starts up normally
3. Startup event calls `inference_backend.wake()`
4. vLLM restores model to GPU (if applicable)
5. Ready for inference

### Scenario 3: Manual Sleep/Wake

1. Call `POST /sleep` to manually put backend to sleep
2. GPU memory is released
3. Call `POST /wake` to restore backend
4. Resume inference

## 📊 Expected Behavior

### Before Implementation
```
ERROR 10-04 10:17:40 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
```

### After Implementation
```
INFO:app:🛑 Starting graceful shutdown...
INFO:app:🧹 Cleaning up vllm backend...
INFO:app:✅ vLLM engine reference cleared
INFO:app:✅ CUDA cache cleared
INFO:app:✅ Garbage collection completed
INFO:app:✅ Backend cleanup completed
INFO:app:✅ Global memory cleanup completed
INFO:app:✅ Graceful shutdown completed successfully
```

## 🔧 Design Decisions

### Why No Signal Handlers?

Initially implemented custom signal handlers (SIGTERM, SIGINT), but removed them because:

1. **HuggingFace Infrastructure**: HuggingFace Spaces has its own signal handling infrastructure
2. **Conflicts**: Custom signal handlers can conflict with the platform's shutdown process
3. **FastAPI Native**: FastAPI's `@app.on_event("shutdown")` is already properly integrated
4. **Simplicity**: Fewer moving parts = more reliable

### Why Separate Sleep/Wake from Shutdown?

1. **Different Use Cases**: Sleep is for temporary pause, shutdown is for termination
2. **Performance**: Sleep mode is faster to resume than full restart
3. **Flexibility**: Manual control allows testing and optimization
4. **Non-Intrusive**: Sleep/wake are optional features that don't affect core functionality

## 🐛 Issues Fixed

### Issue 1: Undefined Variable
**Error**: `NameError: name 'deployment_env' is not defined`  
**Fix**: Removed environment check in wake-up call - safe for all backends

### Issue 2: Signal Handler Conflicts
**Error**: Runtime errors on Space startup  
**Fix**: Removed custom signal handlers, rely on FastAPI native events

### Issue 3: Logger Initialization Order
**Error**: Logger used before definition  
**Fix**: Moved signal import after logger setup

## 📈 Benefits

1. **No More Unexpected Deaths**: vLLM engine shuts down cleanly
2. **Faster Wake-Up**: Sleep mode preserves model in CPU RAM
3. **Better Resource Management**: Proper GPU memory cleanup
4. **Manual Control**: API endpoints for testing and debugging
5. **Production Ready**: Handles all edge cases gracefully

## 🧪 Testing

### Test Graceful Shutdown
```bash
# Check health before shutdown
curl https://your-api-url.hf.space/health

# Wait for Space to go to sleep (or manually stop it)
# Check logs for graceful shutdown messages
```

### Test Sleep/Wake
```bash
# Put to sleep
curl -X POST https://your-api-url.hf.space/sleep

# Check backend status
curl https://your-api-url.hf.space/backend

# Wake up
curl -X POST https://your-api-url.hf.space/wake

# Test inference
curl -X POST https://your-api-url.hf.space/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is financial risk?", "max_new_tokens": 50}'
```

## 📝 Future Improvements

1. **Automatic Sleep**: Auto-sleep after X minutes of inactivity
2. **Sleep Metrics**: Track sleep/wake cycles and performance
3. **Progressive Wake**: Warm up model gradually
4. **Health Check Integration**: Report sleep status in health endpoint

## ✅ Status

- [x] FastAPI shutdown event handler
- [x] vLLM cleanup method with logging
- [x] vLLM sleep/wake methods
- [x] Manual sleep/wake API endpoints
- [x] Startup wake-up check
- [x] Remove signal handlers (simplification)
- [x] Fix undefined variable bug
- [x] Deploy to HuggingFace Space
- [ ] Test on live Space
- [ ] Monitor for 24 hours
- [ ] Document in main README

## 🔗 Related Files

- `app.py`: Main application with shutdown/sleep implementation
- `PROJECT_RULES.md`: Updated with vLLM configuration
- `docs/VLLM_INTEGRATION.md`: vLLM backend documentation
- `README.md`: Project overview and architecture

## 📚 References

- [vLLM Sleep Mode Documentation](https://docs.vllm.ai/en/latest/features/sleep_mode.html)
- [FastAPI Lifecycle Events](https://fastapi.tiangolo.com/advanced/events/)
- [HuggingFace Spaces Docker](https://huggingface.co/docs/hub/spaces-sdks-docker)