Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

arXiv cs.LG / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how bracket-sequence (Dyck language) transformers represent hierarchical structure, comparing whether observed signals are merely decodable or actually used causally.
Through probing and intervention on the residual stream and attention patterns, the authors find that depth, distance, and top-of-stack signals are decodable but do not all play the same causal role.
Masking attention specifically at the true top-of-stack position sharply reduces long-distance accuracy, indicating that certain attention behaviors are causally important.
In contrast, ablating low-dimensional residual-stream subspaces produces comparatively little impact, suggesting that not all decodable internal representations are causally necessary.
The findings also hold in a templated natural-language setting, reinforcing the general claim that decodability alone does not guarantee causal use of internal variables.

Abstract

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.