AI Navigate

[R] A Gradient Descent Misalignment — Causes Normalisation To Emerge

Reddit r/MachineLearning / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper asks whether gradient descent systematically takes the wrong step in activation space, showing that parameters move along the steepest descent while activations do not.
  • It proves this misalignment for simple affine layers, convolution, and attention, and proposes solutions including a new affine-like layer with built-in normalization that preserves degrees of freedom and a new PatchNorm normalizer for convolution.
  • Empirically, the affine-like solution is not scale-invariant and not a normaliser, yet it matches or exceeds BatchNorm/LayerNorm in controlled MLP ablations, suggesting scale invariance is not the primary mechanism and that misalignment may be key; the framework also predicts that increasing batch size should hurt divergence-correcting layers, an effect observed experimentally.
  • These results offer a potential mechanistic explanation for why normalization helps and point to new directions in layer design and normalization methods for neural networks.

This paper, just accepted at ICLR's GRaM workshop, asks a simple question:

Does gradient descent systematically take the wrong step in activation space?

It is shown:

Parameters take the step of steepest descent; activations do not

The paper mathematically demonstrates this for simple affine layers, convolution, and attention.

The work then explores solutions to address this.

The solutions may consequently provide an alternative mechanistic explanation for why normalisation helps at all, as two structurally distinct fixes arise: existing (L2/RMS) normalisers and a new form of fully connected layer (MLP).

Derived is:

  1. A new form of affine-like layer (a.k.a. new form for fully connected/linear layer). featuring inbuilt normalisation whilst preserving DOF (unlike typical normalisers). Hence, a new alternative layer architecture for MLPs.
  2. A new family of normalisers: "PatchNorm" for convolution, opening new directions for empirical search.

Empirical results include:

  • This affine-like solution is not scale-invariant and is not a normaliser, yet it consistently matches or exceeds BatchNorm/LayerNorm in controlled MLP ablation experiments—suggesting that scale invariance is not the primary mechanism at work—but maybe this it is the misalignment.
  • The framework makes a clean, falsifiable prediction: increasing batch size should hurt performance for divergence-correcting layers. This counterintuitive effect is observed empirically and does not hold for BatchNorm or standard affine layers. Corroborating the theory.

Hope this is interesting and worth a read.

  • I've added some (hopefully) interesting intuitions scattered throughout, e.g. the consequences of reweighting LayerNorm's mean & why RMSNorm may need the sqrt-n factor & unifying normalisers and activation functions. Hopefully, all surprising fresh insights - please let me know what you think.

Happy to answer any questions :-)

[ResearchGate Alternative Link] [Peer Reviews]

submitted by /u/GeorgeBird1
[link] [comments]