peppinob-ol
Initial deployment: Attribution Graph Probing app
cb8a7e5
metadata
title: Attribution Graph Probing
emoji: πŸ”¬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: gpl-3.0

πŸ”¬ Attribution Graph Probing

Automated Attribution Graph Analysis through Probe Prompting

Interactive research tool for automated analysis and interpretation of attribution graphs from Sparse Autoencoders (SAE) and Cross-Layer Transcoders (CLT).


πŸš€ Quick Start

This Space implements a 3-stage pipeline for analyzing neural network features:

  1. 🌐 Graph Generation: Generate attribution graphs on Neuronpedia
  2. πŸ” Probe Prompts: Analyze feature activations on semantic concepts
  3. πŸ”— Node Grouping: Automatically classify and name features

Try the Demo

Click through the sidebar pages to explore the Dallas example dataset included in this Space.


πŸ”‘ API Keys Required

To use this Space with your own data, you need:

  1. Neuronpedia API Key - Get it from neuronpedia.org
  2. OpenAI API Key - For concept generation (optional)

Add these as Secrets in Space Settings:

  • NEURONPEDIA_API_KEY=your-key-here
  • OPENAI_API_KEY=your-key-here

Or enter them directly in the sidebar when using the app.


πŸ“Š Features

Stage 1: Graph Generation

  • Generate attribution graphs via Neuronpedia API
  • Extract static metrics (node influence, cumulative influence)
  • Interactive visualizations (layer Γ— context position)
  • Select relevant features for analysis

Stage 2: Probe Prompts

  • Auto-generate semantic concepts via OpenAI
  • Measure feature activations across concepts
  • Automatic checkpoints for long analyses
  • Resume from interruptions

Stage 3: Node Grouping

  • Classify features into 4 categories:
    • Semantic (Dictionary): Specific tokens
    • Semantic (Concept): Related concepts
    • Say "X": Output predictions
    • Relationship: Entity relationships
  • Automatic naming based on activation patterns
  • Upload to Neuronpedia for visualization

πŸ“ Example Dataset

This Space includes the Dallas example:

  • Prompt: "The capital of state containing Dallas is"
  • Target: "Austin"
  • Features: 55 features from Gemma-2-2B model
  • Complete pipeline outputs: Graph, activations, classifications

Navigate to each stage page to explore the example data.


πŸ“– Documentation

  • Complete Guide: See eda/README.md in the Files tab
  • Quick Start: QUICK_START_STREAMLIT.md
  • Main README: readme.md

πŸ”¬ Research Context

This tool is part of research on automated sparse feature interpretation using probe prompting techniques.

Related Work:


πŸ› οΈ Technical Details

Models Supported:

  • Gemma-2-2B, Gemma-2-9B
  • GPT-2 Small
  • Any model with SAE/CLT features on Neuronpedia

Resource Usage:

  • RAM: ~2-3GB for typical analyses
  • CPU: Efficient for API-based processing
  • Storage: Outputs saved during session

πŸ“ How to Use

With Example Data (No API Keys Needed)

  1. Navigate through the 3 stage pages in the sidebar
  2. Load the Dallas example files provided
  3. Explore visualizations and results

With Your Own Data (API Keys Required)

  1. Add your API keys in Settings β†’ Secrets or in the sidebar
  2. Stage 1: Generate a new graph with your prompt
  3. Stage 2: Generate concepts and analyze activations
  4. Stage 3: Classify and name features automatically

🀝 Contributing

This is a research project for mechanistic interpretability. Feedback and contributions welcome!


πŸ“„ License

GPL-3.0 - See LICENSE file for details


Version: 2.0.0-clean
Last Updated: November 2025
Deployed on: Hugging Face Spaces