High dimensional theory of two-phase optimizers

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies LA-DiLoCo, a member of the DiLoCo family, analyzing how a one-worker “LA” variant and its multi-worker two-phase form behave on a high-dimensional linear regression task.
  • It finds that the single-worker algorithm offers a different signal-versus-noise tradeoff compared with SGD, which can be advantageous in many settings.
  • The multi-worker version produces more noise than the single-worker variant, but the paper shows that proper hyperparameter choices can mitigate this extra noise.
  • It extends the analysis to SLA (LA with momentum) and argues that combining two momentum operators can accelerate convergence by effectively reshaping the Hessian spectrum, with Nesterov momentum performing best.
  • Overall, the work positions partially asynchronous two-phase optimizers as a promising new paradigm for understanding and improving optimization in larger training setups.

Abstract

The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the one-worker variant, LA, provides a different tradeoff between signal and noise than SGD, which is beneficial in many scenarios. We also show that the multi-worker version generates more noise than the single worker version, but that this additional noise generation can be ameliorated by appropriate choice of hyperparameters. We conclude with an analysis of SLA -- LA with momentum -- and show that stacking two momentum operators gives an opportunity for acceleration via a non-linear transformation of the "effective'' Hessian spectrum, which is maximized for Nesterov momentum. Altogether our results show that two-phase optimizers represent a fruitful new paradigm for understanding and improving training algorithms.