Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

~/.cache/huggingface/hub/  (or $HF_HOME/hub/)
├── models--LinguaCustodia--llama3.1-8b-fin-v0.3/
│   ├── refs/
│   │   └── main           # Points to current commit hash
│   ├── blobs/             # Actual model files (named by hash)
│   │   ├── 403450e234...  # Model weights
│   │   ├── 7cb18dc9ba...  # Config file
│   │   └── d7edf6bd2a...  # Tokenizer file
│   └── snapshots/         # Symlinks to blobs for each revision
│       ├── aaaaaa.../     # First revision
│       │   ├── config.json -> ../../blobs/7cb18...
│       │   └── pytorch_model.bin -> ../../blobs/403450...
│       └── bbbbbb.../     # Second revision (shares unchanged files)
│           ├── config.json -> ../../blobs/7cb18... (same blob!)
│           └── pytorch_model.bin -> ../../blobs/NEW_HASH...

Key Insights

Symlink-Based Deduplication
- HuggingFace uses symlinks to avoid storing duplicate files
- If a file doesn't change between revisions, it's only stored once
- The blobs/ directory contains actual data
- The snapshots/ directory contains symlinks organized by revision
Cache is Smart
- Models are downloaded ONCE and reused
- Each file is identified by its hash
- Multiple revisions share common files
- No re-download unless files actually change
Why We're Not Seeing Re-downloads
- We ARE using the cache correctly!
- Setting HF_HOME=/data/.huggingface is the right approach
- The issue was disk space, not cache configuration

What We Should Be Doing

✅ Correct Practices (What We're Already Doing)

Setting HF_HOME
```
os.environ["HF_HOME"] = "/data/.huggingface"
```
This is the official way to configure persistent caching.

Using from_pretrained() and pipeline()

pipe = pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=hf_token_lc
)

These methods automatically use the cache - no additional configuration needed!

No force_download We're correctly NOT using force_download=True, which would bypass the cache.

🔧 What We Need to Fix

Disk Space Management
- Monitor available space before downloads
- Clean up failed/incomplete downloads
- Set proper fallback to ephemeral cache
Handle Incomplete Downloads
- HuggingFace may leave .incomplete and .lock files
- These should be cleaned up periodically
Monitor Cache Size
- Use scan-cache to understand disk usage
- Remove old revisions if needed

Optimal Configuration for HuggingFace Spaces

For Persistent Storage (20GB+)

def setup_storage():
    """Optimal setup for HuggingFace Spaces with persistent storage."""
    import os
    import shutil
    
    # 1. Check if HF_HOME is set by Space variables (highest priority)
    if "HF_HOME" in os.environ:
        hf_home = os.environ["HF_HOME"]
        logger.info(f"✅ Using HF_HOME from Space: {hf_home}")
    else:
        # 2. Auto-detect persistent storage
        if os.path.exists("/data"):
            hf_home = "/data/.huggingface"
            os.environ["HF_HOME"] = hf_home
        else:
            hf_home = os.path.expanduser("~/.cache/huggingface")
            os.environ["HF_HOME"] = hf_home
    
    # 3. Create directory
    os.makedirs(hf_home, exist_ok=True)
    
    # 4. Check available space
    total, used, free = shutil.disk_usage(os.path.dirname(hf_home) if hf_home.startswith("/data") else hf_home)
    free_gb = free / (1024**3)
    
    # 5. Validate sufficient space (need 10GB for 8B model)
    if free_gb < 10.0:
        logger.error(f"❌ Insufficient space: {free_gb:.2f} GB free, need 10+ GB")
        # Fallback to ephemeral if persistent is full
        if hf_home.startswith("/data"):
            hf_home = os.path.expanduser("~/.cache/huggingface")
            os.environ["HF_HOME"] = hf_home
            logger.warning("⚠️ Falling back to ephemeral cache")
    
    return hf_home

Model Loading (No Changes Needed!)

# This is already optimal - HuggingFace handles caching automatically
pipe = pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=hf_token_lc,
    # cache_dir is inherited from HF_HOME automatically
    # trust_remote_code=True  # if needed
)

Alternative Approaches (NOT Recommended for Our Use Case)

❌ Approach 1: Manual `cache_dir` Parameter

# DON'T DO THIS - it overrides HF_HOME and is less flexible
model = AutoModel.from_pretrained(
    model_name,
    cache_dir="/data/.huggingface"  # Hardcoded, less flexible
)

Why not: Setting HF_HOME is more flexible and works across all HF libraries.

❌ Approach 2: `local_dir` Parameter

# DON'T DO THIS - bypasses the cache system
snapshot_download(
    repo_id=model_name,
    local_dir="/data/models",  # Creates duplicate, no deduplication
    local_dir_use_symlinks=False
)

Why not: You lose the benefits of deduplication and revision management.

❌ Approach 3: Pre-downloading in Dockerfile

# DON'T DO THIS - doesn't work with dynamic persistent storage
RUN python -c "from transformers import pipeline; pipeline('text-generation', model='...')"

Why not: Docker images are read-only; downloads must happen in persistent storage.

Cache Management Commands

Scan Cache (Useful for Debugging)

# See what's cached
hf cache scan

# Detailed view with all revisions
hf cache scan -v

# See cache location
python -c "from huggingface_hub import scan_cache_dir; print(scan_cache_dir())"

Clean Cache (When Needed)

# Delete specific model
hf cache delete-models LinguaCustodia/llama3.1-8b-fin-v0.3

# Delete old revisions
hf cache delete-old-revisions

# Clear entire cache (nuclear option)
rm -rf ~/.cache/huggingface/hub/
# or
rm -rf /data/.huggingface/hub/

Programmatic Cleanup

from huggingface_hub import scan_cache_dir

# Scan cache
cache_info = scan_cache_dir()

# Find large repos
for repo in cache_info.repos:
    print(f"{repo.repo_id}: {repo.size_on_disk_str}")
    
# Delete specific revision
strategy = cache_info.delete_revisions("LinguaCustodia/llama3.1-8b-fin-v0.3@abc123")
strategy.execute()

Best Practices Summary

✅ DO

Use HF_HOME environment variable for persistent storage
Let HuggingFace handle caching - don't override with cache_dir
Monitor disk space before loading models
Clean up failed downloads (.incomplete, .lock files)
Use symlinks (enabled by default on Linux)
Set fallback to ephemeral cache if persistent storage is full
One HF_HOME per environment (avoid conflicts)

❌ DON'T

Don't use force_download=True (bypasses cache)
Don't use local_dir for models (breaks deduplication)
Don't hardcode cache_dir in model loading
Don't manually copy model files (breaks symlinks)
Don't assume cache is broken - check disk space first!
Don't delete cache blindly - use hf cache scan first

For LinguaCustodia Models

Authentication

# Use the correct token
from huggingface_hub import login
login(token=os.getenv('HF_TOKEN_LC'))  # For private LinguaCustodia models

# Or pass token directly to pipeline
pipe = pipeline(
    "text-generation",
    model="LinguaCustodia/llama3.1-8b-fin-v0.3",
    token=os.getenv('HF_TOKEN_LC')
)

Expected Cache Size

llama3.1-8b-fin-v0.3: ~5GB (with bfloat16)
llama3.1-8b-fin-v0.4: ~5GB (with bfloat16)
Total for both: ~10GB (they share base model weights)

Storage Requirements

Minimum: 10GB persistent storage
Recommended: 20GB (for multiple revisions + wiggle room)
Optimal: 50GB (for multiple models + safety margin)

Conclusion

What We Were Doing Wrong

❌ Nothing fundamentally wrong with our cache configuration!

The issue was:

Disk space exhaustion (0.07 MB free out of 20GB)
Failed downloads leaving partial files
No cleanup mechanism for incomplete downloads

What We Need to Fix

✅ Add disk space checks before downloads
✅ Implement cleanup for .incomplete and .lock files
✅ Add fallback to ephemeral cache when persistent is full
✅ Monitor cache size with hf cache scan

Our Current Setup is Optimal

✅ Setting HF_HOME=/data/.huggingface is correct ✅ Using pipeline() and from_pretrained() is correct ✅ The cache system is working - we just ran out of disk space

Once we clear the persistent storage, the model will:

Download once to /data/.huggingface/hub/
Stay cached across Space restarts
Not be re-downloaded unless the model is updated
Share common files between revisions efficiently

Action Required: Clear persistent storage to free up the 20GB, then redeploy.

HuggingFace Model Caching - Best Practices & Analysis

Current Situation Analysis

What We've Been Doing

The Problem

How HuggingFace Caching Actually Works

Cache Directory Structure