|
|
--- |
|
|
title: Attribution Graph Probing |
|
|
emoji: π¬ |
|
|
colorFrom: blue |
|
|
colorTo: purple |
|
|
sdk: docker |
|
|
app_port: 7860 |
|
|
pinned: false |
|
|
license: gpl-3.0 |
|
|
--- |
|
|
|
|
|
# π¬ Attribution Graph Probing |
|
|
|
|
|
**Automated Attribution Graph Analysis through Probe Prompting** |
|
|
|
|
|
Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT). |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
This Space implements a **3-stage pipeline** for analyzing neural network features: |
|
|
|
|
|
1. **π Graph Generation**: Generate attribution graphs on Neuronpedia |
|
|
2. **π Probe Prompts**: Analyze feature activations on semantic concepts |
|
|
3. **π Node Grouping**: Automatically classify and name features |
|
|
|
|
|
### Try the Demo |
|
|
|
|
|
Click through the sidebar pages to explore the Dallas example dataset included in this Space. |
|
|
|
|
|
--- |
|
|
|
|
|
## π API Keys Required |
|
|
|
|
|
To use this Space with your own data, you need: |
|
|
|
|
|
1. **Neuronpedia API Key** - Get it from [neuronpedia.org](https://www.neuronpedia.org) |
|
|
2. **OpenAI API Key** - For concept generation (optional) |
|
|
|
|
|
Add these as **Secrets** in Space Settings: |
|
|
- `NEURONPEDIA_API_KEY=your-key-here` |
|
|
- `OPENAI_API_KEY=your-key-here` |
|
|
|
|
|
Or enter them directly in the sidebar when using the app. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Features |
|
|
|
|
|
### Stage 1: Graph Generation |
|
|
- Generate attribution graphs via Neuronpedia API |
|
|
- Extract static metrics (node influence, cumulative influence) |
|
|
- Interactive visualizations (layer Γ context position) |
|
|
- Select relevant features for analysis |
|
|
|
|
|
### Stage 2: Probe Prompts |
|
|
- Auto-generate semantic concepts via OpenAI |
|
|
- Measure feature activations across concepts |
|
|
- Automatic checkpoints for long analyses |
|
|
- Resume from interruptions |
|
|
|
|
|
### Stage 3: Node Grouping |
|
|
- Classify features into 4 categories: |
|
|
- **Semantic (Dictionary)**: Specific tokens |
|
|
- **Semantic (Concept)**: Related concepts |
|
|
- **Say "X"**: Output predictions |
|
|
- **Relationship**: Entity relationships |
|
|
- Automatic naming based on activation patterns |
|
|
- Upload to Neuronpedia for visualization |
|
|
|
|
|
--- |
|
|
|
|
|
## π Example Dataset |
|
|
|
|
|
This Space includes the **Dallas example**: |
|
|
- **Prompt**: "The capital of state containing Dallas is" |
|
|
- **Target**: "Austin" |
|
|
- **Features**: 55 features from Gemma-2-2B model |
|
|
- **Complete pipeline outputs**: Graph, activations, classifications |
|
|
|
|
|
Navigate to each stage page to explore the example data. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Documentation |
|
|
|
|
|
- **Complete Guide**: See `eda/README.md` in the Files tab |
|
|
- **Quick Start**: `QUICK_START_STREAMLIT.md` |
|
|
- **Main README**: `readme.md` |
|
|
|
|
|
--- |
|
|
|
|
|
## π¬ Research Context |
|
|
|
|
|
This tool is part of research on **automated sparse feature interpretation** using probe prompting techniques. |
|
|
|
|
|
**Related Work:** |
|
|
- [Circuit Tracer](https://github.com/safety-research/circuit-tracer) by Anthropic |
|
|
- [Attribution Graphs](https://transformer-circuits.pub/2025/attribution-graphs/) |
|
|
- [Neuronpedia](https://www.neuronpedia.org) |
|
|
|
|
|
--- |
|
|
|
|
|
## π οΈ Technical Details |
|
|
|
|
|
**Models Supported:** |
|
|
- Gemma-2-2B, Gemma-2-9B |
|
|
- GPT-2 Small |
|
|
- Any model with SAE/CLT features on Neuronpedia |
|
|
|
|
|
**Resource Usage:** |
|
|
- RAM: ~2-3GB for typical analyses |
|
|
- CPU: Efficient for API-based processing |
|
|
- Storage: Outputs saved during session |
|
|
|
|
|
--- |
|
|
|
|
|
## π How to Use |
|
|
|
|
|
### With Example Data (No API Keys Needed) |
|
|
1. Navigate through the 3 stage pages in the sidebar |
|
|
2. Load the Dallas example files provided |
|
|
3. Explore visualizations and results |
|
|
|
|
|
### With Your Own Data (API Keys Required) |
|
|
1. Add your API keys in Settings β Secrets or in the sidebar |
|
|
2. **Stage 1**: Generate a new graph with your prompt |
|
|
3. **Stage 2**: Generate concepts and analyze activations |
|
|
4. **Stage 3**: Classify and name features automatically |
|
|
|
|
|
--- |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
This is a research project for mechanistic interpretability. Feedback and contributions welcome! |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
GPL-3.0 - See LICENSE file for details |
|
|
|
|
|
--- |
|
|
|
|
|
**Version**: 2.0.0-clean |
|
|
**Last Updated**: November 2025 |
|
|
**Deployed on**: Hugging Face Spaces |
|
|
|
|
|
|