title: Attribution Graph Probing
emoji: π¬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: gpl-3.0
π¬ Attribution Graph Probing
Automated Attribution Graph Analysis through Probe Prompting
Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT).
π Quick Start
This Space implements a 3-stage pipeline for analyzing neural network features:
- π Graph Generation: Generate attribution graphs on Neuronpedia
- π Probe Prompts: Analyze feature activations on semantic concepts
- π Node Grouping: Automatically classify and name features
Try the Demo
Click through the sidebar pages to explore the Dallas example dataset included in this Space.
π API Keys Required
To use this Space with your own data, you need:
- Neuronpedia API Key - Get it from neuronpedia.org
- OpenAI API Key - For concept generation (optional)
Add these as Secrets in Space Settings:
NEURONPEDIA_API_KEY=your-key-hereOPENAI_API_KEY=your-key-here
Or enter them directly in the sidebar when using the app.
π Features
Stage 1: Graph Generation
- Generate attribution graphs via Neuronpedia API
- Extract static metrics (node influence, cumulative influence)
- Interactive visualizations (layer Γ context position)
- Select relevant features for analysis
Stage 2: Probe Prompts
- Auto-generate semantic concepts via OpenAI
- Measure feature activations across concepts
- Automatic checkpoints for long analyses
- Resume from interruptions
Stage 3: Node Grouping
- Classify features into 4 categories:
- Semantic (Dictionary): Specific tokens
- Semantic (Concept): Related concepts
- Say "X": Output predictions
- Relationship: Entity relationships
- Automatic naming based on activation patterns
- Upload to Neuronpedia for visualization
π Example Dataset
This Space includes the Dallas example:
- Prompt: "The capital of state containing Dallas is"
- Target: "Austin"
- Features: 55 features from Gemma-2-2B model
- Complete pipeline outputs: Graph, activations, classifications
Navigate to each stage page to explore the example data.
π Documentation
- Complete Guide: See
eda/README.mdin the Files tab - Quick Start:
QUICK_START_STREAMLIT.md - Main README:
readme.md
π¬ Research Context
This tool is part of research on automated sparse feature interpretation using probe prompting techniques.
Related Work:
- Circuit Tracer by Anthropic
- Attribution Graphs
- Neuronpedia
π οΈ Technical Details
Models Supported:
- Gemma-2-2B, Gemma-2-9B
- GPT-2 Small
- Any model with SAE/CLT features on Neuronpedia
Resource Usage:
- RAM: ~2-3GB for typical analyses
- CPU: Efficient for API-based processing
- Storage: Outputs saved during session
π How to Use
With Example Data (No API Keys Needed)
- Navigate through the 3 stage pages in the sidebar
- Load the Dallas example files provided
- Explore visualizations and results
With Your Own Data (API Keys Required)
- Add your API keys in Settings β Secrets or in the sidebar
- Stage 1: Generate a new graph with your prompt
- Stage 2: Generate concepts and analyze activations
- Stage 3: Classify and name features automatically
π€ Contributing
This is a research project for mechanistic interpretability. Feedback and contributions welcome!
π License
GPL-3.0 - See LICENSE file for details
Version: 2.0.0-clean
Last Updated: November 2025
Deployed on: Hugging Face Spaces