Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models

arXiv cs.CL / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines whether transformer language models use shared neural mechanisms across different syntactic constructions, linking this to cross-constructional principles from linguistics.
  • Using fine-grained causal interpretability (activation patching) on filler-gap dependencies and negative polarity item (NPI) licensing, the authors find a localized shared mechanism for filler-gap dependencies in early-to-middle layers, while NPI processing does not show a unified shared mechanism.
  • The identified mechanisms generalize to out-of-distribution data, but a supervised distributed alignment search approach is reported to be vulnerable to overfitting on narrow linguistic distributions.
  • As validation, the authors show that manipulating the attention heads/MLP blocks implicated by activation patching improves performance on acceptability judgment benchmarks.
  • Overall, the study provides causal evidence about which internal components correspond to specific syntactic phenomena and how reliably those interpretations transfer beyond the training distribution.

Abstract

While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.