On the Theory of Continual Learning with Gradient Descent for Neural Networks

arXiv stat.ML / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies continual learning—adapting to a stream of tasks without forgetting earlier ones—using a tractable neural network setup involving one-hidden-layer quadratic networks trained by gradient descent.
  • It analyzes training on a sequence of XOR-cluster datasets with Gaussian noise, where each task corresponds to clusters with orthogonal means, enabling explicit characterization of gradient descent dynamics.
  • The authors derive tight, closed-form bounds on train-time forgetting rates as functions of optimization iterations, sample size, number of tasks, and hidden-layer width.
  • By applying an algorithmic stability framework, the work also bounds the generalization gap and translates this into guarantees on forgetting at test time.
  • Numerical experiments confirm the theoretical predictions and demonstrate how the problem parameters collectively determine continual-learning forgetting behavior.

Abstract

Continual learning, the ability of a model to adapt to an ongoing sequence of tasks without forgetting earlier ones, is a central goal of artificial intelligence. To better understand its underlying mechanisms, we study the limitations of continual learning in a tractable yet representative setting. Specifically, we analyze one-hidden-layer quadratic neural networks trained by gradient descent on a sequence of XOR-cluster datasets with Gaussian noise, where different tasks correspond to clusters with orthogonal means. Our analysis is based on a tight characterization of gradient descent dynamics for the training loss, which yields explicit bounds on the rate of train-time forgetting as functions of the number of iterations, sample size, number of tasks, and hidden-layer width. We then leverage an algorithmic stability framework to bound the generalization gap, leading to corresponding guarantees on test-time forgetting. Together, our results provide the first closed-form guarantees for forgetting in continual learning with neural networks and show how key problem parameters jointly govern forgetting dynamics. Numerical experiments corroborate our theoretical results.