A Mechanism Study of Delayed Loss Spikes in Batch-Normalized Linear Models

arXiv stat.ML / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies a stylized hypothesis for delayed loss spikes in neural-network training: batch normalization can postpone instability by effectively increasing the learning rate gradually during an otherwise stable descent.
  • It provides a theorem-level analysis for batch-normalized linear models, with the main results focused on whitened square-loss linear regression.
  • For the whitened square-loss case, the authors derive explicit conditions for when a loss “rising edge” does not occur and when instability onset is delayed, including bounds on the waiting time to directional onset.
  • They show that, within the whitened regime, the rising edge self-stabilizes after finitely many iterations and use a square-loss decomposition to obtain a concrete delayed-spike mechanism.
  • For logistic regression, results are more limited and depend on very restrictive active-margin assumptions, yielding only a finite-horizon directional precursor in a knife-edge regime, with additional appendix-only bounds under extra conditions.

Abstract

Delayed loss spikes have been reported in neural-network training, but existing theory mainly explains earlier non-monotone behavior caused by overly large fixed learning rates. We study one stylized hypothesis: normalization can postpone instability by gradually increasing the effective learning rate during otherwise stable descent. To test this hypothesis at theorem level, we analyze batch-normalized linear models. Our flagship result concerns whitened square-loss linear regression, where we derive explicit no-rising-edge and delayed-onset conditions, bound the waiting time to directional onset, and show that the rising edge self-stabilizes within finitely many iterations. Combined with a square-loss decomposition, this yields a concrete delayed-spike mechanism in the whitened regime. For logistic regression, under highly restrictive active-margin assumptions, we prove only a supporting finite-horizon directional precursor in a knife-edge regime, with an optional appendix-only loss lower bound under an extra non-degeneracy condition. The paper should therefore be read as a stylized mechanism study rather than a general explanation of neural-network loss spikes. Within that scope, the results isolate one concrete delayed-instability pathway induced by batch normalization.