Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

arXiv cs.CL / 4/21/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces PIE, a cross-layer-transcoder-native end-to-end framework (Pruning → automatic Interpretation → evaluation) to measure how feature pruning affects behavioral fidelity and interpretability.
It proposes Feature Attribution Patching (FAP) to rank CLT features by aggregating gradient-weighted write contributions, plus FAP-Synergy to rerank features while accounting for synergy effects.
Experiments on IOI and Doc-String across multiple feature budgets show the “FAP family” consistently achieves best or near-best KL-divergence behavior retention versus several other pruning baselines.
The authors report that, for IOI on CLTs built for Llama-3.2-1B and Gemma-2-2B, pruning to K=100 features can match the KL fidelity that random selection would need ~4k features to reach, enabling roughly 40× fewer interpretation/evaluation calls.
Interpretation quality is assessed using FADE-style metrics, with FAP-Synergy delivering the most noticeable improvements in stricter (lower-budget) regimes.

Abstract

Existing feature-interpretation pipelines typically operate on uniformly sampled units, but only a small fraction of cross-layer transcoder (CLT) features matter for a target behavior, with the rest resulting in expensive feature explaining and evaluating costs. We introduce the first CLT-native end-to-end framework, PIE, connecting Pruning, automatic Interpretation, and interpretation Evaluation, enabling systematic measurement of behavioral fidelity and downstream interpretability under pruning. To achieve this, we propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions, and FAP-Synergy, a synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics. Across IOI and Doc-String, across budgets

K \in \{50, 100, 200, 400, 800\}

, and across FAP, FAP-Synergy, Activation-Magnitude, and ACDC-style pruning, the FAP family consistently achieves the best or near-best fidelity, with FAP-Synergy providing its clearest gains in strict-budget regimes. On IOI with CLTs for Llama-3.2-1B and Gemma-2-2B, pruning to

K=100

features matches the KL fidelity that random selection from the active feature set requires

\approx 4

k features to achieve (

\approx 40\times

compression), enabling

\approx 40\times

fewer interpretation/evaluation calls while substantially reducing low-quality features.

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents

Dev.to

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Dev.to

Open Source Contributors Needed for Skillware & Rooms (AI/ML/Python)

Dev.to

Production LLM systematically violates tool schema constraints to invent UI features; observed over ~2,400 messages [D]

Reddit r/MachineLearning

My AI system kept randomly switching to French mid-answer and it took me way too long to figure out why

Reddit r/artificial

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

Key Points

Abstract

Related Articles

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Open Source Contributors Needed for Skillware & Rooms (AI/ML/Python)

Production LLM systematically violates tool schema constraints to invent UI features; observed over ~2,400 messages [D]

My AI system kept randomly switching to French mid-answer and it took me way too long to figure out why

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer