File size: 3,953 Bytes
b3c4653
70f99b6
cb8a7e5
 
 
b3c4653
cb8a7e5
b3c4653
70f99b6
b3c4653
 
cb8a7e5
b3c4653
cb8a7e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3c4653
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: Attribution Graph Probing
emoji: πŸ”¬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: gpl-3.0
---

# πŸ”¬ Attribution Graph Probing

**Automated Attribution Graph Analysis through Probe Prompting**

Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT).

---

## πŸš€ Quick Start

This Space implements a **3-stage pipeline** for analyzing neural network features:

1. **🌐 Graph Generation**: Generate attribution graphs on Neuronpedia
2. **πŸ” Probe Prompts**: Analyze feature activations on semantic concepts  
3. **πŸ”— Node Grouping**: Automatically classify and name features

### Try the Demo

Click through the sidebar pages to explore the Dallas example dataset included in this Space.

---

## πŸ”‘ API Keys Required

To use this Space with your own data, you need:

1. **Neuronpedia API Key** - Get it from [neuronpedia.org](https://www.neuronpedia.org)
2. **OpenAI API Key** - For concept generation (optional)

Add these as **Secrets** in Space Settings:
- `NEURONPEDIA_API_KEY=your-key-here`
- `OPENAI_API_KEY=your-key-here`

Or enter them directly in the sidebar when using the app.

---

## πŸ“Š Features

### Stage 1: Graph Generation
- Generate attribution graphs via Neuronpedia API
- Extract static metrics (node influence, cumulative influence)
- Interactive visualizations (layer Γ— context position)
- Select relevant features for analysis

### Stage 2: Probe Prompts
- Auto-generate semantic concepts via OpenAI
- Measure feature activations across concepts
- Automatic checkpoints for long analyses
- Resume from interruptions

### Stage 3: Node Grouping
- Classify features into 4 categories:
  - **Semantic (Dictionary)**: Specific tokens
  - **Semantic (Concept)**: Related concepts
  - **Say "X"**: Output predictions
  - **Relationship**: Entity relationships
- Automatic naming based on activation patterns
- Upload to Neuronpedia for visualization

---

## πŸ“ Example Dataset

This Space includes the **Dallas example**:
- **Prompt**: "The capital of state containing Dallas is"
- **Target**: "Austin"
- **Features**: 55 features from Gemma-2-2B model
- **Complete pipeline outputs**: Graph, activations, classifications

Navigate to each stage page to explore the example data.

---

## πŸ“– Documentation

- **Complete Guide**: See `eda/README.md` in the Files tab
- **Quick Start**: `QUICK_START_STREAMLIT.md`
- **Main README**: `readme.md`

---

## πŸ”¬ Research Context

This tool is part of research on **automated sparse feature interpretation** using probe prompting techniques.

**Related Work:**
- [Circuit Tracer](https://github.com/safety-research/circuit-tracer) by Anthropic
- [Attribution Graphs](https://transformer-circuits.pub/2025/attribution-graphs/)
- [Neuronpedia](https://www.neuronpedia.org)

---

## πŸ› οΈ Technical Details

**Models Supported:**
- Gemma-2-2B, Gemma-2-9B
- GPT-2 Small
- Any model with SAE/CLT features on Neuronpedia

**Resource Usage:**
- RAM: ~2-3GB for typical analyses
- CPU: Efficient for API-based processing
- Storage: Outputs saved during session

---

## πŸ“ How to Use

### With Example Data (No API Keys Needed)
1. Navigate through the 3 stage pages in the sidebar
2. Load the Dallas example files provided
3. Explore visualizations and results

### With Your Own Data (API Keys Required)
1. Add your API keys in Settings β†’ Secrets or in the sidebar
2. **Stage 1**: Generate a new graph with your prompt
3. **Stage 2**: Generate concepts and analyze activations
4. **Stage 3**: Classify and name features automatically

---

## 🀝 Contributing

This is a research project for mechanistic interpretability. Feedback and contributions welcome!

---

## πŸ“„ License

GPL-3.0 - See LICENSE file for details

---

**Version**: 2.0.0-clean  
**Last Updated**: November 2025  
**Deployed on**: Hugging Face Spaces