The Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks

arXiv cs.LG / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes how LayerNorm versus RMSNorm impose different geometric constraints (LayerNorm’s mean-centering to a linear hyperplane vs. RMSNorm’s projection onto a sphere) and shows these constraints have an exact effect on Bayesian complexity measured by Local Learning Coefficient (LLC).
It proves that LayerNorm reduces the LLC of a subsequent weight matrix by exactly m/2 (m = output dimension), while RMSNorm preserves LLC, implying a training-independent complexity change determined by data-manifold geometry.
The authors identify a geometric threshold for codimension-one manifolds: any non-zero curvature preserves LLC (no drop), whereas only affinely flat manifolds trigger the guaranteed m/2 reduction.
At finite sample sizes, the paper shows this threshold becomes a smooth crossover whose width depends on the fraction of the data distribution experiencing curvature rather than simply whether curvature exists.
Experiments using wrLLC validate the theoretical predictions, and the study extends the results to show Softmax-simplex inputs can induce an effective “m/2 LLC drop” via a smuggled bias when combined with an explicit downstream bias.

Abstract

LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm's mean-centering step, by confining data to a linear hyperplane (through the origin), reduces the Local Learning Coefficient (LLC) of the subsequent weight matrix by exactly

m/2

(where

m

is its output dimension); RMSNorm's projection onto a sphere preserves the LLC entirely. This reduction is structurally guaranteed before any training begins, determined by data manifold geometry alone. The underlying condition is a geometric threshold: for the codimension-one manifolds we study, the LLC drop is binary -- any non-zero curvature, regardless of sign or magnitude, is sufficient to preserve the LLC, while only affinely flat manifolds cause the drop. At finite sample sizes this threshold acquires a smooth crossover whose width depends on how much of the data distribution actually experiences the curvature, not merely on whether curvature exists somewhere. We verify both predictions experimentally with controlled single-layer scaling experiments using the wrLLC framework. We further show that Softmax simplex data introduces a "smuggled bias" that activates the same

m/2

LLC drop when paired with an explicit downstream bias, proved via the affine symmetry extension of the main theorem and confirmed empirically.

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

Dev.to

The Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks

Key Points

Abstract

Related Articles

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer