AI Navigate

Engineering Verifiable Modularity in Transformers via Per-Layer Supervision

arXiv cs.AI / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Transformers exhibit distributed redundancy, so ablation of a single attention head yields minimal behavioral change, making interpretability challenging.
  • The authors propose an architectural approach using dual-stream processing, per-layer supervision, and gated attention regularization to reveal modularity in the model.
  • When trained with per-layer supervision, ablation effects are 5–23x larger than comparably trained controls, enabling 4x greater control leverage over targeted behaviors.
  • Without per-layer supervision ablation damage stays near zero with low variance, but with per-layer supervision the effects spread widely, indicating wake of modular circuits and revealing which predictions depend on which circuits.
  • The approach is validated via engineered features that capture computational dynamics, architecture providing positive control for modularity, and causal experiments showing functional reorganization where different tasks route through different attention heads, enabling active interpretability.

Abstract

Transformers resist surgical control. Ablating an attention head identified as critical for capitalization produces minimal behavioral change because distributed redundancy compensates for damage. This Hydra effect renders interpretability illusory: we may identify components through correlation, but cannot predict or control their causal role. We demonstrate that architectural interventions can expose hidden modularity. Our approach combines dual-stream processing separating token and contextual representations, per-layer supervision providing independent gradient signal at each depth, and gated attention regularizing toward discrete activation patterns. When trained with per-layer supervision, models produce ablation effects 5 to 23 times larger than architecturally identical controls trained with standard objectives. This enables 4 times greater control leverage on targeted behaviors: scaling identified attention heads produces smooth, predictable changes in model output. The key finding is architectural. Without per-layer supervision, ablation damage concentrates near zero with low variance (Winograd standard deviation 0.63%). With per-layer supervision, effects spread widely (standard deviation 6.32%), revealing which predictions depend on which circuits. The larger variance is not measurement noise but the signature of unmasked modularity. We validate our approach through three components: engineered features that capture computational dynamics rather than vocabulary structure (validated by near-zero correlation with raw activation clustering), an architecture providing positive control for modularity, and causal experiments demonstrating functional reorganization where different tasks route through different attention heads. This es tablishes a methodology for transforming interpretability from passive observation to active control.