Spaces:

Peppinob
/

attribution-graph-probing

Running

App Files Files Community

attribution-graph-probing / README.md

peppinob-ol

Initial deployment: Attribution Graph Probing app

cb8a7e5 23 days ago

preview code

raw

history blame contribute delete

3.95 kB

metadata

title: Attribution Graph Probing
emoji: 🔬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: gpl-3.0

🔬 Attribution Graph Probing

Automated Attribution Graph Analysis through Probe Prompting

Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT).

🚀 Quick Start

This Space implements a 3-stage pipeline for analyzing neural network features:

🌐 Graph Generation: Generate attribution graphs on Neuronpedia
🔍 Probe Prompts: Analyze feature activations on semantic concepts
🔗 Node Grouping: Automatically classify and name features

Try the Demo

Click through the sidebar pages to explore the Dallas example dataset included in this Space.

🔑 API Keys Required

To use this Space with your own data, you need:

Neuronpedia API Key - Get it from neuronpedia.org
OpenAI API Key - For concept generation (optional)

Add these as Secrets in Space Settings:

NEURONPEDIA_API_KEY=your-key-here
OPENAI_API_KEY=your-key-here

Or enter them directly in the sidebar when using the app.

📊 Features

Stage 1: Graph Generation

Generate attribution graphs via Neuronpedia API
Extract static metrics (node influence, cumulative influence)
Interactive visualizations (layer × context position)
Select relevant features for analysis

Stage 2: Probe Prompts

Auto-generate semantic concepts via OpenAI
Measure feature activations across concepts
Automatic checkpoints for long analyses
Resume from interruptions

Stage 3: Node Grouping

Classify features into 4 categories:
- Semantic (Dictionary): Specific tokens
- Semantic (Concept): Related concepts
- Say "X": Output predictions
- Relationship: Entity relationships
Automatic naming based on activation patterns
Upload to Neuronpedia for visualization

📁 Example Dataset

This Space includes the Dallas example:

Prompt: "The capital of state containing Dallas is"
Target: "Austin"
Features: 55 features from Gemma-2-2B model
Complete pipeline outputs: Graph, activations, classifications

Navigate to each stage page to explore the example data.

📖 Documentation

Complete Guide: See eda/README.md in the Files tab
Quick Start: QUICK_START_STREAMLIT.md
Main README: readme.md

🔬 Research Context

This tool is part of research on automated sparse feature interpretation using probe prompting techniques.

Related Work:

🛠️ Technical Details

Models Supported:

Gemma-2-2B, Gemma-2-9B
GPT-2 Small
Any model with SAE/CLT features on Neuronpedia

Resource Usage:

RAM: ~2-3GB for typical analyses
CPU: Efficient for API-based processing
Storage: Outputs saved during session

📝 How to Use

With Example Data (No API Keys Needed)

Navigate through the 3 stage pages in the sidebar
Load the Dallas example files provided
Explore visualizations and results

With Your Own Data (API Keys Required)

Add your API keys in Settings → Secrets or in the sidebar
Stage 1: Generate a new graph with your prompt
Stage 2: Generate concepts and analyze activations
Stage 3: Classify and name features automatically

🤝 Contributing

This is a research project for mechanistic interpretability. Feedback and contributions welcome!

📄 License

GPL-3.0 - See LICENSE file for details

Version: 2.0.0-clean
Last Updated: November 2025
Deployed on: Hugging Face Spaces