Spaces:
Sleeping
Sleeping
| title: Attribution Graph Probing | |
| emoji: π¬ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: gpl-3.0 | |
| # π¬ Attribution Graph Probing | |
| **Automated Attribution Graph Analysis through Probe Prompting** | |
| Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT). | |
| --- | |
| ## π Quick Start | |
| This Space implements a **3-stage pipeline** for analyzing neural network features: | |
| 1. **π Graph Generation**: Generate attribution graphs on Neuronpedia | |
| 2. **π Probe Prompts**: Analyze feature activations on semantic concepts | |
| 3. **π Node Grouping**: Automatically classify and name features | |
| ### Try the Demo | |
| Click through the sidebar pages to explore the Dallas example dataset included in this Space. | |
| --- | |
| ## π API Keys Required | |
| To use this Space with your own data, you need: | |
| 1. **Neuronpedia API Key** - Get it from [neuronpedia.org](https://www.neuronpedia.org) | |
| 2. **OpenAI API Key** - For concept generation (optional) | |
| Add these as **Secrets** in Space Settings: | |
| - `NEURONPEDIA_API_KEY=your-key-here` | |
| - `OPENAI_API_KEY=your-key-here` | |
| Or enter them directly in the sidebar when using the app. | |
| --- | |
| ## π Features | |
| ### Stage 1: Graph Generation | |
| - Generate attribution graphs via Neuronpedia API | |
| - Extract static metrics (node influence, cumulative influence) | |
| - Interactive visualizations (layer Γ context position) | |
| - Select relevant features for analysis | |
| ### Stage 2: Probe Prompts | |
| - Auto-generate semantic concepts via OpenAI | |
| - Measure feature activations across concepts | |
| - Automatic checkpoints for long analyses | |
| - Resume from interruptions | |
| ### Stage 3: Node Grouping | |
| - Classify features into 4 categories: | |
| - **Semantic (Dictionary)**: Specific tokens | |
| - **Semantic (Concept)**: Related concepts | |
| - **Say "X"**: Output predictions | |
| - **Relationship**: Entity relationships | |
| - Automatic naming based on activation patterns | |
| - Upload to Neuronpedia for visualization | |
| --- | |
| ## π Example Dataset | |
| This Space includes the **Dallas example**: | |
| - **Prompt**: "The capital of state containing Dallas is" | |
| - **Target**: "Austin" | |
| - **Features**: 55 features from Gemma-2-2B model | |
| - **Complete pipeline outputs**: Graph, activations, classifications | |
| Navigate to each stage page to explore the example data. | |
| --- | |
| ## π Documentation | |
| - **Complete Guide**: See `eda/README.md` in the Files tab | |
| - **Quick Start**: `QUICK_START_STREAMLIT.md` | |
| - **Main README**: `readme.md` | |
| --- | |
| ## π¬ Research Context | |
| This tool is part of research on **automated sparse feature interpretation** using probe prompting techniques. | |
| **Related Work:** | |
| - [Circuit Tracer](https://github.com/safety-research/circuit-tracer) by Anthropic | |
| - [Attribution Graphs](https://transformer-circuits.pub/2025/attribution-graphs/) | |
| - [Neuronpedia](https://www.neuronpedia.org) | |
| --- | |
| ## π οΈ Technical Details | |
| **Models Supported:** | |
| - Gemma-2-2B, Gemma-2-9B | |
| - GPT-2 Small | |
| - Any model with SAE/CLT features on Neuronpedia | |
| **Resource Usage:** | |
| - RAM: ~2-3GB for typical analyses | |
| - CPU: Efficient for API-based processing | |
| - Storage: Outputs saved during session | |
| --- | |
| ## π How to Use | |
| ### With Example Data (No API Keys Needed) | |
| 1. Navigate through the 3 stage pages in the sidebar | |
| 2. Load the Dallas example files provided | |
| 3. Explore visualizations and results | |
| ### With Your Own Data (API Keys Required) | |
| 1. Add your API keys in Settings β Secrets or in the sidebar | |
| 2. **Stage 1**: Generate a new graph with your prompt | |
| 3. **Stage 2**: Generate concepts and analyze activations | |
| 4. **Stage 3**: Classify and name features automatically | |
| --- | |
| ## π€ Contributing | |
| This is a research project for mechanistic interpretability. Feedback and contributions welcome! | |
| --- | |
| ## π License | |
| GPL-3.0 - See LICENSE file for details | |
| --- | |
| **Version**: 2.0.0-clean | |
| **Last Updated**: November 2025 | |
| **Deployed on**: Hugging Face Spaces | |