Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers
arXiv cs.LG / 4/28/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper revisits Dong et al. (2021)’s finding that Transformers using only self-attention (without skip connections or feed-forward/MLP) experience rapid rank collapse where token representations converge to a single direction, and argues that the standard explanation is incomplete for architectural understanding.
- It shows that layer normalization (LN) is exactly affine-rank-neutral, meaning it preserves the affine rank of the token representation set, so the common claim that LN “plays no role” is inaccurate even if LN doesn’t directly drive collapse.
- The authors demonstrate that residual connections generally prevent rank collapse in real Transformer architectures (e.g., BERT-base) in a measure-theoretic sense, and they characterize the MLP’s unique contribution as creating feature directions outside the linear span of the original token embeddings.
- Beyond rank collapse, the work identifies head-channel non-identifiability: after multi-head outputs are summed and mixed by the output projection, contributions from individual heads cannot be uniquely recovered, leaving a substantial ambiguity in per-layer head attribution.
- The paper proposes a low-overhead constructive partial remedy—position-gated output projection (PG-OP)—and unifies multiple reported collapse phenomena under a symmetry-breaking framework tied to distinct symmetries in the Transformer forward pass.
Related Articles
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™
Dev.to
Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to
How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to
🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to