A Mechanism Study of Delayed Loss Spikes in Batch-Normalized Linear Models

arXiv stat.ML / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies a stylized hypothesis for delayed loss spikes in neural-network training: batch normalization can postpone instability by effectively increasing the learning rate gradually during an otherwise stable descent.
It provides a theorem-level analysis for batch-normalized linear models, with the main results focused on whitened square-loss linear regression.
For the whitened square-loss case, the authors derive explicit conditions for when a loss “rising edge” does not occur and when instability onset is delayed, including bounds on the waiting time to directional onset.
They show that, within the whitened regime, the rising edge self-stabilizes after finitely many iterations and use a square-loss decomposition to obtain a concrete delayed-spike mechanism.
For logistic regression, results are more limited and depend on very restrictive active-margin assumptions, yielding only a finite-horizon directional precursor in a knife-edge regime, with additional appendix-only bounds under extra conditions.

Abstract

Delayed loss spikes have been reported in neural-network training, but existing theory mainly explains earlier non-monotone behavior caused by overly large fixed learning rates. We study one stylized hypothesis: normalization can postpone instability by gradually increasing the effective learning rate during otherwise stable descent. To test this hypothesis at theorem level, we analyze batch-normalized linear models. Our flagship result concerns whitened square-loss linear regression, where we derive explicit no-rising-edge and delayed-onset conditions, bound the waiting time to directional onset, and show that the rising edge self-stabilizes within finitely many iterations. Combined with a square-loss decomposition, this yields a concrete delayed-spike mechanism in the whitened regime. For logistic regression, under highly restrictive active-margin assumptions, we prove only a supporting finite-horizon directional precursor in a knife-edge regime, with an optional appendix-only loss lower bound under an extra non-degeneracy condition. The paper should therefore be read as a stylized mechanism study rather than a general explanation of neural-network loss spikes. Within that scope, the results isolate one concrete delayed-instability pathway induced by batch normalization.