Spaces:

Peppinob
/

attribution-graph-probing

Sleeping

App Files Files Community

attribution-graph-probing / README.md

peppinob-ol

Initial deployment: Attribution Graph Probing app

cb8a7e5 28 days ago

preview code

raw

history blame

3.95 kB

	---
	title: Attribution Graph Probing
	emoji: 🔬
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	license: gpl-3.0
	---

	# 🔬 Attribution Graph Probing

	Automated Attribution Graph Analysis through Probe Prompting

	Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT).

	---

	## 🚀 Quick Start

	This Space implements a 3-stage pipeline for analyzing neural network features:

	1. 🌐 Graph Generation: Generate attribution graphs on Neuronpedia
	2. 🔍 Probe Prompts: Analyze feature activations on semantic concepts
	3. 🔗 Node Grouping: Automatically classify and name features

	### Try the Demo

	Click through the sidebar pages to explore the Dallas example dataset included in this Space.

	---

	## 🔑 API Keys Required

	To use this Space with your own data, you need:

	1. Neuronpedia API Key - Get it from [neuronpedia.org](https://www.neuronpedia.org)
	2. OpenAI API Key - For concept generation (optional)

	Add these as Secrets in Space Settings:
	- `NEURONPEDIA_API_KEY=your-key-here`
	- `OPENAI_API_KEY=your-key-here`

	Or enter them directly in the sidebar when using the app.

	---

	## 📊 Features

	### Stage 1: Graph Generation
	- Generate attribution graphs via Neuronpedia API
	- Extract static metrics (node influence, cumulative influence)
	- Interactive visualizations (layer × context position)
	- Select relevant features for analysis

	### Stage 2: Probe Prompts
	- Auto-generate semantic concepts via OpenAI
	- Measure feature activations across concepts
	- Automatic checkpoints for long analyses
	- Resume from interruptions

	### Stage 3: Node Grouping
	- Classify features into 4 categories:
	- Semantic (Dictionary): Specific tokens
	- Semantic (Concept): Related concepts
	- Say "X": Output predictions
	- Relationship: Entity relationships
	- Automatic naming based on activation patterns
	- Upload to Neuronpedia for visualization

	---

	## 📁 Example Dataset

	This Space includes the Dallas example:
	- Prompt: "The capital of state containing Dallas is"
	- Target: "Austin"
	- Features: 55 features from Gemma-2-2B model
	- Complete pipeline outputs: Graph, activations, classifications

	Navigate to each stage page to explore the example data.

	---

	## 📖 Documentation

	- Complete Guide: See `eda/README.md` in the Files tab
	- Quick Start: `QUICK_START_STREAMLIT.md`
	- Main README: `readme.md`

	---

	## 🔬 Research Context

	This tool is part of research on automated sparse feature interpretation using probe prompting techniques.

	Related Work:
	- [Circuit Tracer](https://github.com/safety-research/circuit-tracer) by Anthropic
	- [Attribution Graphs](https://transformer-circuits.pub/2025/attribution-graphs/)
	- [Neuronpedia](https://www.neuronpedia.org)

	---

	## 🛠️ Technical Details

	Models Supported:
	- Gemma-2-2B, Gemma-2-9B
	- GPT-2 Small
	- Any model with SAE/CLT features on Neuronpedia

	Resource Usage:
	- RAM: ~2-3GB for typical analyses
	- CPU: Efficient for API-based processing
	- Storage: Outputs saved during session

	---

	## 📝 How to Use

	### With Example Data (No API Keys Needed)
	1. Navigate through the 3 stage pages in the sidebar
	2. Load the Dallas example files provided
	3. Explore visualizations and results

	### With Your Own Data (API Keys Required)
	1. Add your API keys in Settings → Secrets or in the sidebar
	2. Stage 1: Generate a new graph with your prompt
	3. Stage 2: Generate concepts and analyze activations
	4. Stage 3: Classify and name features automatically

	---

	## 🤝 Contributing

	This is a research project for mechanistic interpretability. Feedback and contributions welcome!

	---

	## 📄 License

	GPL-3.0 - See LICENSE file for details

	---

	Version: 2.0.0-clean
	Last Updated: November 2025
	Deployed on: Hugging Face Spaces