Gating Enables Curvature: A Geometric Expressivity Gap in Attention

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes gated attention using the geometry of representations, modeling attention outputs as mean parameters of Gaussian distributions and studying the resulting Fisher–Rao geometry.
It proves that ungated (affine) attention is limited to intrinsically flat statistical manifolds, while multiplicative gating can realize non-flat geometries, including positively curved manifolds.
The authors formalize a “geometric expressivity gap” showing that gated attention has strictly greater representational geometric capability than ungated attention.
Empirical results link this geometry to behavior: gated models show higher representation curvature and better performance on tasks needing nonlinear decision boundaries, with no consistent gains for linear-boundary tasks.
The study also finds a structured regime where curvature increases under repeated composition, producing a systematic depth amplification effect.

Abstract

Multiplicative gating is widely used in neural architectures and has recently been applied to attention layers to improve performance and training stability in large language models. Despite the success of gated attention, the mathematical implications of gated attention mechanisms remain poorly understood. We study attention through the geometry of its representations by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher--Rao geometry. We show that ungated attention operator is restricted to intrinsically flat statistical manifolds due to its affine structure, while multiplicative gating enables non-flat geometries, including positively curved manifolds that are unattainable in the ungated setting. These results establish a geometric expressivity gap between ungated and gated attention. Empirically, we show that gated models exhibit higher representation curvature and improved performance on tasks requiring nonlinear decision boundaries whereas they provide no consistent advantage on tasks with linear decision boundaries. Furthermore, we identify a structured regime in which curvature accumulates under composition, yielding a systematic depth amplification effect.