How iteration order influences convergence and stability in deep learning

arXiv stat.ML / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates training stability and convergence for neural networks under constant learning rates and small batch sizes, aiming to explain optimization instabilities beyond learning-rate scheduling.
  • It argues that the order in which gradient updates are composed can materially change stability and convergence behavior in gradient-based optimizers.
  • Using backward-SGD (which reverses the usual forward composition order by reverting update composition across batch gradients), the authors show that in contractive regions near minima backward-SGD converges to a point while standard forward-SGD tends to converge to a distribution.
  • Although full backward-SGD is computationally expensive, the work presents it as a proof of concept that creatively reusing prior batches and altering iteration composition may improve training stability.
  • The authors frame their results as a novel and largely unexplored optimization avenue, supported by theoretical analysis and supporting experiments.

Abstract

Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the composition order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which produces parameter iterates at each step by reverting the usual forward composition order of batch gradients. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights that the extra freedom of modifying the usual iteration composition by reusing creatively previous batches at each optimization step may have important beneficial effects in improving training. Our experiments provide a proof of concept supporting this phenomenon. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.