File size: 3,953 Bytes
b3c4653 70f99b6 cb8a7e5 b3c4653 cb8a7e5 b3c4653 70f99b6 b3c4653 cb8a7e5 b3c4653 cb8a7e5 b3c4653 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
title: Attribution Graph Probing
emoji: π¬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: gpl-3.0
---
# π¬ Attribution Graph Probing
**Automated Attribution Graph Analysis through Probe Prompting**
Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT).
---
## π Quick Start
This Space implements a **3-stage pipeline** for analyzing neural network features:
1. **π Graph Generation**: Generate attribution graphs on Neuronpedia
2. **π Probe Prompts**: Analyze feature activations on semantic concepts
3. **π Node Grouping**: Automatically classify and name features
### Try the Demo
Click through the sidebar pages to explore the Dallas example dataset included in this Space.
---
## π API Keys Required
To use this Space with your own data, you need:
1. **Neuronpedia API Key** - Get it from [neuronpedia.org](https://www.neuronpedia.org)
2. **OpenAI API Key** - For concept generation (optional)
Add these as **Secrets** in Space Settings:
- `NEURONPEDIA_API_KEY=your-key-here`
- `OPENAI_API_KEY=your-key-here`
Or enter them directly in the sidebar when using the app.
---
## π Features
### Stage 1: Graph Generation
- Generate attribution graphs via Neuronpedia API
- Extract static metrics (node influence, cumulative influence)
- Interactive visualizations (layer Γ context position)
- Select relevant features for analysis
### Stage 2: Probe Prompts
- Auto-generate semantic concepts via OpenAI
- Measure feature activations across concepts
- Automatic checkpoints for long analyses
- Resume from interruptions
### Stage 3: Node Grouping
- Classify features into 4 categories:
- **Semantic (Dictionary)**: Specific tokens
- **Semantic (Concept)**: Related concepts
- **Say "X"**: Output predictions
- **Relationship**: Entity relationships
- Automatic naming based on activation patterns
- Upload to Neuronpedia for visualization
---
## π Example Dataset
This Space includes the **Dallas example**:
- **Prompt**: "The capital of state containing Dallas is"
- **Target**: "Austin"
- **Features**: 55 features from Gemma-2-2B model
- **Complete pipeline outputs**: Graph, activations, classifications
Navigate to each stage page to explore the example data.
---
## π Documentation
- **Complete Guide**: See `eda/README.md` in the Files tab
- **Quick Start**: `QUICK_START_STREAMLIT.md`
- **Main README**: `readme.md`
---
## π¬ Research Context
This tool is part of research on **automated sparse feature interpretation** using probe prompting techniques.
**Related Work:**
- [Circuit Tracer](https://github.com/safety-research/circuit-tracer) by Anthropic
- [Attribution Graphs](https://transformer-circuits.pub/2025/attribution-graphs/)
- [Neuronpedia](https://www.neuronpedia.org)
---
## π οΈ Technical Details
**Models Supported:**
- Gemma-2-2B, Gemma-2-9B
- GPT-2 Small
- Any model with SAE/CLT features on Neuronpedia
**Resource Usage:**
- RAM: ~2-3GB for typical analyses
- CPU: Efficient for API-based processing
- Storage: Outputs saved during session
---
## π How to Use
### With Example Data (No API Keys Needed)
1. Navigate through the 3 stage pages in the sidebar
2. Load the Dallas example files provided
3. Explore visualizations and results
### With Your Own Data (API Keys Required)
1. Add your API keys in Settings β Secrets or in the sidebar
2. **Stage 1**: Generate a new graph with your prompt
3. **Stage 2**: Generate concepts and analyze activations
4. **Stage 3**: Classify and name features automatically
---
## π€ Contributing
This is a research project for mechanistic interpretability. Feedback and contributions welcome!
---
## π License
GPL-3.0 - See LICENSE file for details
---
**Version**: 2.0.0-clean
**Last Updated**: November 2025
**Deployed on**: Hugging Face Spaces
|