Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

arXiv cs.CL / 4/21/2026

📰 NewsModels & Research

Key Points

  • The paper introduces PIE, a cross-layer-transcoder-native end-to-end framework (Pruning → automatic Interpretation → evaluation) to measure how feature pruning affects behavioral fidelity and interpretability.
  • It proposes Feature Attribution Patching (FAP) to rank CLT features by aggregating gradient-weighted write contributions, plus FAP-Synergy to rerank features while accounting for synergy effects.
  • Experiments on IOI and Doc-String across multiple feature budgets show the “FAP family” consistently achieves best or near-best KL-divergence behavior retention versus several other pruning baselines.
  • The authors report that, for IOI on CLTs built for Llama-3.2-1B and Gemma-2-2B, pruning to K=100 features can match the KL fidelity that random selection would need ~4k features to reach, enabling roughly 40× fewer interpretation/evaluation calls.
  • Interpretation quality is assessed using FADE-style metrics, with FAP-Synergy delivering the most noticeable improvements in stricter (lower-budget) regimes.

Abstract

Existing feature-interpretation pipelines typically operate on uniformly sampled units, but only a small fraction of cross-layer transcoder (CLT) features matter for a target behavior, with the rest resulting in expensive feature explaining and evaluating costs. We introduce the first CLT-native end-to-end framework, PIE, connecting Pruning, automatic Interpretation, and interpretation Evaluation, enabling systematic measurement of behavioral fidelity and downstream interpretability under pruning. To achieve this, we propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions, and FAP-Synergy, a synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics. Across IOI and Doc-String, across budgets K \in \{50, 100, 200, 400, 800\}, and across FAP, FAP-Synergy, Activation-Magnitude, and ACDC-style pruning, the FAP family consistently achieves the best or near-best fidelity, with FAP-Synergy providing its clearest gains in strict-budget regimes. On IOI with CLTs for Llama-3.2-1B and Gemma-2-2B, pruning to K=100 features matches the KL fidelity that random selection from the active feature set requires \approx 4k features to achieve (\approx 40\times compression), enabling \approx 40\times fewer interpretation/evaluation calls while substantially reducing low-quality features.