Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper revisits Dong et al. (2021)’s finding that Transformers using only self-attention (without skip connections or feed-forward/MLP) experience rapid rank collapse where token representations converge to a single direction, and argues that the standard explanation is incomplete for architectural understanding.
It shows that layer normalization (LN) is exactly affine-rank-neutral, meaning it preserves the affine rank of the token representation set, so the common claim that LN “plays no role” is inaccurate even if LN doesn’t directly drive collapse.
The authors demonstrate that residual connections generally prevent rank collapse in real Transformer architectures (e.g., BERT-base) in a measure-theoretic sense, and they characterize the MLP’s unique contribution as creating feature directions outside the linear span of the original token embeddings.
Beyond rank collapse, the work identifies head-channel non-identifiability: after multi-head outputs are summed and mixed by the output projection, contributions from individual heads cannot be uniquely recovered, leaving a substantial ambiguity in per-layer head attribution.
The paper proposes a low-overhead constructive partial remedy—position-gated output projection (PG-OP)—and unifies multiple reported collapse phenomena under a symmetry-breaking framework tied to distinct symmetries in the Transformer forward pass.

Abstract

A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature -- rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse -- are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer's forward pass.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Dev.to

Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer