Spaces:

Peppinob
/

attribution-graph-probing

Running

File size: 3,953 Bytes

---
title: Attribution Graph Probing
emoji: 🔬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: gpl-3.0
---

# 🔬 Attribution Graph Probing

**Automated Attribution Graph Analysis through Probe Prompting**

Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT).

---

## 🚀 Quick Start

This Space implements a **3-stage pipeline** for analyzing neural network features:

1. **🌐 Graph Generation**: Generate attribution graphs on Neuronpedia
2. **🔍 Probe Prompts**: Analyze feature activations on semantic concepts  
3. **🔗 Node Grouping**: Automatically classify and name features

### Try the Demo

Click through the sidebar pages to explore the Dallas example dataset included in this Space.

---

## 🔑 API Keys Required

To use this Space with your own data, you need:

1. **Neuronpedia API Key** - Get it from [neuronpedia.org](https://www.neuronpedia.org)
2. **OpenAI API Key** - For concept generation (optional)

Add these as **Secrets** in Space Settings:
- `NEURONPEDIA_API_KEY=your-key-here`
- `OPENAI_API_KEY=your-key-here`

Or enter them directly in the sidebar when using the app.

---

## 📊 Features

### Stage 1: Graph Generation
- Generate attribution graphs via Neuronpedia API
- Extract static metrics (node influence, cumulative influence)
- Interactive visualizations (layer × context position)
- Select relevant features for analysis

### Stage 2: Probe Prompts
- Auto-generate semantic concepts via OpenAI
- Measure feature activations across concepts
- Automatic checkpoints for long analyses
- Resume from interruptions

### Stage 3: Node Grouping
- Classify features into 4 categories:
  - **Semantic (Dictionary)**: Specific tokens
  - **Semantic (Concept)**: Related concepts
  - **Say "X"**: Output predictions
  - **Relationship**: Entity relationships
- Automatic naming based on activation patterns
- Upload to Neuronpedia for visualization

---

## 📁 Example Dataset

This Space includes the **Dallas example**:
- **Prompt**: "The capital of state containing Dallas is"
- **Target**: "Austin"
- **Features**: 55 features from Gemma-2-2B model
- **Complete pipeline outputs**: Graph, activations, classifications

Navigate to each stage page to explore the example data.

---

## 📖 Documentation

- **Complete Guide**: See `eda/README.md` in the Files tab
- **Quick Start**: `QUICK_START_STREAMLIT.md`
- **Main README**: `readme.md`

---

## 🔬 Research Context

This tool is part of research on **automated sparse feature interpretation** using probe prompting techniques.

**Related Work:**
- [Circuit Tracer](https://github.com/safety-research/circuit-tracer) by Anthropic
- [Attribution Graphs](https://transformer-circuits.pub/2025/attribution-graphs/)
- [Neuronpedia](https://www.neuronpedia.org)

---

## 🛠️ Technical Details

**Models Supported:**
- Gemma-2-2B, Gemma-2-9B
- GPT-2 Small
- Any model with SAE/CLT features on Neuronpedia

**Resource Usage:**
- RAM: ~2-3GB for typical analyses
- CPU: Efficient for API-based processing
- Storage: Outputs saved during session

---

## 📝 How to Use

### With Example Data (No API Keys Needed)
1. Navigate through the 3 stage pages in the sidebar
2. Load the Dallas example files provided
3. Explore visualizations and results

### With Your Own Data (API Keys Required)
1. Add your API keys in Settings → Secrets or in the sidebar
2. **Stage 1**: Generate a new graph with your prompt
3. **Stage 2**: Generate concepts and analyze activations
4. **Stage 3**: Classify and name features automatically

---

## 🤝 Contributing

This is a research project for mechanistic interpretability. Feedback and contributions welcome!

---

## 📄 License

GPL-3.0 - See LICENSE file for details

---

**Version**: 2.0.0-clean  
**Last Updated**: November 2025  
**Deployed on**: Hugging Face Spaces