Spaces:
Runtime error
HuggingFace Model Caching - Best Practices & Analysis
Current Situation Analysis
What We've Been Doing
We've been setting HF_HOME=/data/.huggingface to store models in persistent storage. This is correct but we encountered disk space issues.
The Problem
The persistent storage (20GB) filled up completely (0.07 MB free) due to:
- Failed download attempts leaving partial files
- No automatic cleanup of incomplete downloads
- Multiple revisions being cached unnecessarily
How HuggingFace Caching Actually Works
Cache Directory Structure
~/.cache/huggingface/hub/ (or $HF_HOME/hub/)
βββ models--LinguaCustodia--llama3.1-8b-fin-v0.3/
β βββ refs/
β β βββ main # Points to current commit hash
β βββ blobs/ # Actual model files (named by hash)
β β βββ 403450e234... # Model weights
β β βββ 7cb18dc9ba... # Config file
β β βββ d7edf6bd2a... # Tokenizer file
β βββ snapshots/ # Symlinks to blobs for each revision
β βββ aaaaaa.../ # First revision
β β βββ config.json -> ../../blobs/7cb18...
β β βββ pytorch_model.bin -> ../../blobs/403450...
β βββ bbbbbb.../ # Second revision (shares unchanged files)
β βββ config.json -> ../../blobs/7cb18... (same blob!)
β βββ pytorch_model.bin -> ../../blobs/NEW_HASH...
Key Insights
Symlink-Based Deduplication
- HuggingFace uses symlinks to avoid storing duplicate files
- If a file doesn't change between revisions, it's only stored once
- The
blobs/directory contains actual data - The
snapshots/directory contains symlinks organized by revision
Cache is Smart
- Models are downloaded ONCE and reused
- Each file is identified by its hash
- Multiple revisions share common files
- No re-download unless files actually change
Why We're Not Seeing Re-downloads
- We ARE using the cache correctly!
- Setting
HF_HOME=/data/.huggingfaceis the right approach - The issue was disk space, not cache configuration
What We Should Be Doing
β Correct Practices (What We're Already Doing)
Setting HF_HOME
os.environ["HF_HOME"] = "/data/.huggingface"This is the official way to configure persistent caching.
Using
from_pretrained()andpipeline()pipe = pipeline( "text-generation", model=model_name, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map="auto", token=hf_token_lc )These methods automatically use the cache - no additional configuration needed!
No
force_downloadWe're correctly NOT usingforce_download=True, which would bypass the cache.
π§ What We Need to Fix
Disk Space Management
- Monitor available space before downloads
- Clean up failed/incomplete downloads
- Set proper fallback to ephemeral cache
Handle Incomplete Downloads
- HuggingFace may leave
.incompleteand.lockfiles - These should be cleaned up periodically
- HuggingFace may leave
Monitor Cache Size
- Use
scan-cacheto understand disk usage - Remove old revisions if needed
- Use
Optimal Configuration for HuggingFace Spaces
For Persistent Storage (20GB+)
def setup_storage():
"""Optimal setup for HuggingFace Spaces with persistent storage."""
import os
import shutil
# 1. Check if HF_HOME is set by Space variables (highest priority)
if "HF_HOME" in os.environ:
hf_home = os.environ["HF_HOME"]
logger.info(f"β
Using HF_HOME from Space: {hf_home}")
else:
# 2. Auto-detect persistent storage
if os.path.exists("/data"):
hf_home = "/data/.huggingface"
os.environ["HF_HOME"] = hf_home
else:
hf_home = os.path.expanduser("~/.cache/huggingface")
os.environ["HF_HOME"] = hf_home
# 3. Create directory
os.makedirs(hf_home, exist_ok=True)
# 4. Check available space
total, used, free = shutil.disk_usage(os.path.dirname(hf_home) if hf_home.startswith("/data") else hf_home)
free_gb = free / (1024**3)
# 5. Validate sufficient space (need 10GB for 8B model)
if free_gb < 10.0:
logger.error(f"β Insufficient space: {free_gb:.2f} GB free, need 10+ GB")
# Fallback to ephemeral if persistent is full
if hf_home.startswith("/data"):
hf_home = os.path.expanduser("~/.cache/huggingface")
os.environ["HF_HOME"] = hf_home
logger.warning("β οΈ Falling back to ephemeral cache")
return hf_home
Model Loading (No Changes Needed!)
# This is already optimal - HuggingFace handles caching automatically
pipe = pipeline(
"text-generation",
model=model_name,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
device_map="auto",
token=hf_token_lc,
# cache_dir is inherited from HF_HOME automatically
# trust_remote_code=True # if needed
)
Alternative Approaches (NOT Recommended for Our Use Case)
β Approach 1: Manual cache_dir Parameter
# DON'T DO THIS - it overrides HF_HOME and is less flexible
model = AutoModel.from_pretrained(
model_name,
cache_dir="/data/.huggingface" # Hardcoded, less flexible
)
Why not: Setting HF_HOME is more flexible and works across all HF libraries.
β Approach 2: local_dir Parameter
# DON'T DO THIS - bypasses the cache system
snapshot_download(
repo_id=model_name,
local_dir="/data/models", # Creates duplicate, no deduplication
local_dir_use_symlinks=False
)
Why not: You lose the benefits of deduplication and revision management.
β Approach 3: Pre-downloading in Dockerfile
# DON'T DO THIS - doesn't work with dynamic persistent storage
RUN python -c "from transformers import pipeline; pipeline('text-generation', model='...')"
Why not: Docker images are read-only; downloads must happen in persistent storage.
Cache Management Commands
Scan Cache (Useful for Debugging)
# See what's cached
hf cache scan
# Detailed view with all revisions
hf cache scan -v
# See cache location
python -c "from huggingface_hub import scan_cache_dir; print(scan_cache_dir())"
Clean Cache (When Needed)
# Delete specific model
hf cache delete-models LinguaCustodia/llama3.1-8b-fin-v0.3
# Delete old revisions
hf cache delete-old-revisions
# Clear entire cache (nuclear option)
rm -rf ~/.cache/huggingface/hub/
# or
rm -rf /data/.huggingface/hub/
Programmatic Cleanup
from huggingface_hub import scan_cache_dir
# Scan cache
cache_info = scan_cache_dir()
# Find large repos
for repo in cache_info.repos:
print(f"{repo.repo_id}: {repo.size_on_disk_str}")
# Delete specific revision
strategy = cache_info.delete_revisions("LinguaCustodia/llama3.1-8b-fin-v0.3@abc123")
strategy.execute()
Best Practices Summary
β DO
- Use
HF_HOMEenvironment variable for persistent storage - Let HuggingFace handle caching - don't override with
cache_dir - Monitor disk space before loading models
- Clean up failed downloads (
.incomplete,.lockfiles) - Use symlinks (enabled by default on Linux)
- Set fallback to ephemeral cache if persistent storage is full
- One
HF_HOMEper environment (avoid conflicts)
β DON'T
- Don't use
force_download=True(bypasses cache) - Don't use
local_dirfor models (breaks deduplication) - Don't hardcode
cache_dirin model loading - Don't manually copy model files (breaks symlinks)
- Don't assume cache is broken - check disk space first!
- Don't delete cache blindly - use
hf cache scanfirst
For LinguaCustodia Models
Authentication
# Use the correct token
from huggingface_hub import login
login(token=os.getenv('HF_TOKEN_LC')) # For private LinguaCustodia models
# Or pass token directly to pipeline
pipe = pipeline(
"text-generation",
model="LinguaCustodia/llama3.1-8b-fin-v0.3",
token=os.getenv('HF_TOKEN_LC')
)
Expected Cache Size
- llama3.1-8b-fin-v0.3: ~5GB (with bfloat16)
- llama3.1-8b-fin-v0.4: ~5GB (with bfloat16)
- Total for both: ~10GB (they share base model weights)
Storage Requirements
- Minimum: 10GB persistent storage
- Recommended: 20GB (for multiple revisions + wiggle room)
- Optimal: 50GB (for multiple models + safety margin)
Conclusion
What We Were Doing Wrong
β Nothing fundamentally wrong with our cache configuration!
The issue was:
- Disk space exhaustion (0.07 MB free out of 20GB)
- Failed downloads leaving partial files
- No cleanup mechanism for incomplete downloads
What We Need to Fix
- β Add disk space checks before downloads
- β
Implement cleanup for
.incompleteand.lockfiles - β Add fallback to ephemeral cache when persistent is full
- β
Monitor cache size with
hf cache scan
Our Current Setup is Optimal
β
Setting HF_HOME=/data/.huggingface is correct
β
Using pipeline() and from_pretrained() is correct
β
The cache system is working - we just ran out of disk space
Once we clear the persistent storage, the model will:
- Download once to
/data/.huggingface/hub/ - Stay cached across Space restarts
- Not be re-downloaded unless the model is updated
- Share common files between revisions efficiently
Action Required: Clear persistent storage to free up the 20GB, then redeploy.