peppinob-ol
Initial deployment: Attribution Graph Probing app
cb8a7e5
---
title: Attribution Graph Probing
emoji: πŸ”¬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: gpl-3.0
---
# πŸ”¬ Attribution Graph Probing
**Automated Attribution Graph Analysis through Probe Prompting**
Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT).
---
## πŸš€ Quick Start
This Space implements a **3-stage pipeline** for analyzing neural network features:
1. **🌐 Graph Generation**: Generate attribution graphs on Neuronpedia
2. **πŸ” Probe Prompts**: Analyze feature activations on semantic concepts
3. **πŸ”— Node Grouping**: Automatically classify and name features
### Try the Demo
Click through the sidebar pages to explore the Dallas example dataset included in this Space.
---
## πŸ”‘ API Keys Required
To use this Space with your own data, you need:
1. **Neuronpedia API Key** - Get it from [neuronpedia.org](https://www.neuronpedia.org)
2. **OpenAI API Key** - For concept generation (optional)
Add these as **Secrets** in Space Settings:
- `NEURONPEDIA_API_KEY=your-key-here`
- `OPENAI_API_KEY=your-key-here`
Or enter them directly in the sidebar when using the app.
---
## πŸ“Š Features
### Stage 1: Graph Generation
- Generate attribution graphs via Neuronpedia API
- Extract static metrics (node influence, cumulative influence)
- Interactive visualizations (layer Γ— context position)
- Select relevant features for analysis
### Stage 2: Probe Prompts
- Auto-generate semantic concepts via OpenAI
- Measure feature activations across concepts
- Automatic checkpoints for long analyses
- Resume from interruptions
### Stage 3: Node Grouping
- Classify features into 4 categories:
- **Semantic (Dictionary)**: Specific tokens
- **Semantic (Concept)**: Related concepts
- **Say "X"**: Output predictions
- **Relationship**: Entity relationships
- Automatic naming based on activation patterns
- Upload to Neuronpedia for visualization
---
## πŸ“ Example Dataset
This Space includes the **Dallas example**:
- **Prompt**: "The capital of state containing Dallas is"
- **Target**: "Austin"
- **Features**: 55 features from Gemma-2-2B model
- **Complete pipeline outputs**: Graph, activations, classifications
Navigate to each stage page to explore the example data.
---
## πŸ“– Documentation
- **Complete Guide**: See `eda/README.md` in the Files tab
- **Quick Start**: `QUICK_START_STREAMLIT.md`
- **Main README**: `readme.md`
---
## πŸ”¬ Research Context
This tool is part of research on **automated sparse feature interpretation** using probe prompting techniques.
**Related Work:**
- [Circuit Tracer](https://github.com/safety-research/circuit-tracer) by Anthropic
- [Attribution Graphs](https://transformer-circuits.pub/2025/attribution-graphs/)
- [Neuronpedia](https://www.neuronpedia.org)
---
## πŸ› οΈ Technical Details
**Models Supported:**
- Gemma-2-2B, Gemma-2-9B
- GPT-2 Small
- Any model with SAE/CLT features on Neuronpedia
**Resource Usage:**
- RAM: ~2-3GB for typical analyses
- CPU: Efficient for API-based processing
- Storage: Outputs saved during session
---
## πŸ“ How to Use
### With Example Data (No API Keys Needed)
1. Navigate through the 3 stage pages in the sidebar
2. Load the Dallas example files provided
3. Explore visualizations and results
### With Your Own Data (API Keys Required)
1. Add your API keys in Settings β†’ Secrets or in the sidebar
2. **Stage 1**: Generate a new graph with your prompt
3. **Stage 2**: Generate concepts and analyze activations
4. **Stage 3**: Classify and name features automatically
---
## 🀝 Contributing
This is a research project for mechanistic interpretability. Feedback and contributions welcome!
---
## πŸ“„ License
GPL-3.0 - See LICENSE file for details
---
**Version**: 2.0.0-clean
**Last Updated**: November 2025
**Deployed on**: Hugging Face Spaces