Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

arXiv cs.CL / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper uses causal interventions in Transformer language models to study how syntactic “islands” block extraction, showing that models reproduce human acceptability judgments that vary by lexical content.
  • It finds that extraction from coordination islands uses the same filler-gap mechanisms as canonical wh-dependencies, but those mechanisms are selectively inhibited to different degrees.
  • By isolating functionally relevant subspaces across Transformer blocks, attention modules, and MLPs, the authors provide mechanistic evidence linking representational structure to syntactic constraints.
  • A projection of a large unrelated-text corpus onto the causal subspaces leads to a new hypothesis that the conjunction “and” is encoded differently in extractable versus non-extractable constructions (relational vs. purely conjunctive uses).
  • Overall, the work demonstrates how mechanistic interpretability techniques can generate testable linguistic hypotheses about representation and processing in Transformer LMs.

Abstract

We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., "I know what he hates art and loves" vs. "I know what he looked down and saw"). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction "and" is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.