dragonllm-finance-models / docs /HF_CACHE_BEST_PRACTICES.md
jeanbaptdzd's picture
feat: Clean deployment to HuggingFace Space with model config test endpoint
8c0b652

HuggingFace Model Caching - Best Practices & Analysis

Current Situation Analysis

What We've Been Doing

We've been setting HF_HOME=/data/.huggingface to store models in persistent storage. This is correct but we encountered disk space issues.

The Problem

The persistent storage (20GB) filled up completely (0.07 MB free) due to:

  1. Failed download attempts leaving partial files
  2. No automatic cleanup of incomplete downloads
  3. Multiple revisions being cached unnecessarily

How HuggingFace Caching Actually Works

Cache Directory Structure

~/.cache/huggingface/hub/  (or $HF_HOME/hub/)
β”œβ”€β”€ models--LinguaCustodia--llama3.1-8b-fin-v0.3/
β”‚   β”œβ”€β”€ refs/
β”‚   β”‚   └── main           # Points to current commit hash
β”‚   β”œβ”€β”€ blobs/             # Actual model files (named by hash)
β”‚   β”‚   β”œβ”€β”€ 403450e234...  # Model weights
β”‚   β”‚   β”œβ”€β”€ 7cb18dc9ba...  # Config file
β”‚   β”‚   └── d7edf6bd2a...  # Tokenizer file
β”‚   └── snapshots/         # Symlinks to blobs for each revision
β”‚       β”œβ”€β”€ aaaaaa.../     # First revision
β”‚       β”‚   β”œβ”€β”€ config.json -> ../../blobs/7cb18...
β”‚       β”‚   └── pytorch_model.bin -> ../../blobs/403450...
β”‚       └── bbbbbb.../     # Second revision (shares unchanged files)
β”‚           β”œβ”€β”€ config.json -> ../../blobs/7cb18... (same blob!)
β”‚           └── pytorch_model.bin -> ../../blobs/NEW_HASH...

Key Insights

  1. Symlink-Based Deduplication

    • HuggingFace uses symlinks to avoid storing duplicate files
    • If a file doesn't change between revisions, it's only stored once
    • The blobs/ directory contains actual data
    • The snapshots/ directory contains symlinks organized by revision
  2. Cache is Smart

    • Models are downloaded ONCE and reused
    • Each file is identified by its hash
    • Multiple revisions share common files
    • No re-download unless files actually change
  3. Why We're Not Seeing Re-downloads

    • We ARE using the cache correctly!
    • Setting HF_HOME=/data/.huggingface is the right approach
    • The issue was disk space, not cache configuration

What We Should Be Doing

βœ… Correct Practices (What We're Already Doing)

  1. Setting HF_HOME

    os.environ["HF_HOME"] = "/data/.huggingface"
    

    This is the official way to configure persistent caching.

  2. Using from_pretrained() and pipeline()

    pipe = pipeline(
        "text-generation",
        model=model_name,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        token=hf_token_lc
    )
    

    These methods automatically use the cache - no additional configuration needed!

  3. No force_download We're correctly NOT using force_download=True, which would bypass the cache.

πŸ”§ What We Need to Fix

  1. Disk Space Management

    • Monitor available space before downloads
    • Clean up failed/incomplete downloads
    • Set proper fallback to ephemeral cache
  2. Handle Incomplete Downloads

    • HuggingFace may leave .incomplete and .lock files
    • These should be cleaned up periodically
  3. Monitor Cache Size

    • Use scan-cache to understand disk usage
    • Remove old revisions if needed

Optimal Configuration for HuggingFace Spaces

For Persistent Storage (20GB+)

def setup_storage():
    """Optimal setup for HuggingFace Spaces with persistent storage."""
    import os
    import shutil
    
    # 1. Check if HF_HOME is set by Space variables (highest priority)
    if "HF_HOME" in os.environ:
        hf_home = os.environ["HF_HOME"]
        logger.info(f"βœ… Using HF_HOME from Space: {hf_home}")
    else:
        # 2. Auto-detect persistent storage
        if os.path.exists("/data"):
            hf_home = "/data/.huggingface"
            os.environ["HF_HOME"] = hf_home
        else:
            hf_home = os.path.expanduser("~/.cache/huggingface")
            os.environ["HF_HOME"] = hf_home
    
    # 3. Create directory
    os.makedirs(hf_home, exist_ok=True)
    
    # 4. Check available space
    total, used, free = shutil.disk_usage(os.path.dirname(hf_home) if hf_home.startswith("/data") else hf_home)
    free_gb = free / (1024**3)
    
    # 5. Validate sufficient space (need 10GB for 8B model)
    if free_gb < 10.0:
        logger.error(f"❌ Insufficient space: {free_gb:.2f} GB free, need 10+ GB")
        # Fallback to ephemeral if persistent is full
        if hf_home.startswith("/data"):
            hf_home = os.path.expanduser("~/.cache/huggingface")
            os.environ["HF_HOME"] = hf_home
            logger.warning("⚠️ Falling back to ephemeral cache")
    
    return hf_home

Model Loading (No Changes Needed!)

# This is already optimal - HuggingFace handles caching automatically
pipe = pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=hf_token_lc,
    # cache_dir is inherited from HF_HOME automatically
    # trust_remote_code=True  # if needed
)

Alternative Approaches (NOT Recommended for Our Use Case)

❌ Approach 1: Manual cache_dir Parameter

# DON'T DO THIS - it overrides HF_HOME and is less flexible
model = AutoModel.from_pretrained(
    model_name,
    cache_dir="/data/.huggingface"  # Hardcoded, less flexible
)

Why not: Setting HF_HOME is more flexible and works across all HF libraries.

❌ Approach 2: local_dir Parameter

# DON'T DO THIS - bypasses the cache system
snapshot_download(
    repo_id=model_name,
    local_dir="/data/models",  # Creates duplicate, no deduplication
    local_dir_use_symlinks=False
)

Why not: You lose the benefits of deduplication and revision management.

❌ Approach 3: Pre-downloading in Dockerfile

# DON'T DO THIS - doesn't work with dynamic persistent storage
RUN python -c "from transformers import pipeline; pipeline('text-generation', model='...')"

Why not: Docker images are read-only; downloads must happen in persistent storage.

Cache Management Commands

Scan Cache (Useful for Debugging)

# See what's cached
hf cache scan

# Detailed view with all revisions
hf cache scan -v

# See cache location
python -c "from huggingface_hub import scan_cache_dir; print(scan_cache_dir())"

Clean Cache (When Needed)

# Delete specific model
hf cache delete-models LinguaCustodia/llama3.1-8b-fin-v0.3

# Delete old revisions
hf cache delete-old-revisions

# Clear entire cache (nuclear option)
rm -rf ~/.cache/huggingface/hub/
# or
rm -rf /data/.huggingface/hub/

Programmatic Cleanup

from huggingface_hub import scan_cache_dir

# Scan cache
cache_info = scan_cache_dir()

# Find large repos
for repo in cache_info.repos:
    print(f"{repo.repo_id}: {repo.size_on_disk_str}")
    
# Delete specific revision
strategy = cache_info.delete_revisions("LinguaCustodia/llama3.1-8b-fin-v0.3@abc123")
strategy.execute()

Best Practices Summary

βœ… DO

  1. Use HF_HOME environment variable for persistent storage
  2. Let HuggingFace handle caching - don't override with cache_dir
  3. Monitor disk space before loading models
  4. Clean up failed downloads (.incomplete, .lock files)
  5. Use symlinks (enabled by default on Linux)
  6. Set fallback to ephemeral cache if persistent storage is full
  7. One HF_HOME per environment (avoid conflicts)

❌ DON'T

  1. Don't use force_download=True (bypasses cache)
  2. Don't use local_dir for models (breaks deduplication)
  3. Don't hardcode cache_dir in model loading
  4. Don't manually copy model files (breaks symlinks)
  5. Don't assume cache is broken - check disk space first!
  6. Don't delete cache blindly - use hf cache scan first

For LinguaCustodia Models

Authentication

# Use the correct token
from huggingface_hub import login
login(token=os.getenv('HF_TOKEN_LC'))  # For private LinguaCustodia models

# Or pass token directly to pipeline
pipe = pipeline(
    "text-generation",
    model="LinguaCustodia/llama3.1-8b-fin-v0.3",
    token=os.getenv('HF_TOKEN_LC')
)

Expected Cache Size

  • llama3.1-8b-fin-v0.3: ~5GB (with bfloat16)
  • llama3.1-8b-fin-v0.4: ~5GB (with bfloat16)
  • Total for both: ~10GB (they share base model weights)

Storage Requirements

  • Minimum: 10GB persistent storage
  • Recommended: 20GB (for multiple revisions + wiggle room)
  • Optimal: 50GB (for multiple models + safety margin)

Conclusion

What We Were Doing Wrong

❌ Nothing fundamentally wrong with our cache configuration!

The issue was:

  1. Disk space exhaustion (0.07 MB free out of 20GB)
  2. Failed downloads leaving partial files
  3. No cleanup mechanism for incomplete downloads

What We Need to Fix

  1. βœ… Add disk space checks before downloads
  2. βœ… Implement cleanup for .incomplete and .lock files
  3. βœ… Add fallback to ephemeral cache when persistent is full
  4. βœ… Monitor cache size with hf cache scan

Our Current Setup is Optimal

βœ… Setting HF_HOME=/data/.huggingface is correct βœ… Using pipeline() and from_pretrained() is correct βœ… The cache system is working - we just ran out of disk space

Once we clear the persistent storage, the model will:

  • Download once to /data/.huggingface/hub/
  • Stay cached across Space restarts
  • Not be re-downloaded unless the model is updated
  • Share common files between revisions efficiently

Action Required: Clear persistent storage to free up the 20GB, then redeploy.