High dimensional theory of two-phase optimizers
arXiv cs.LG / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies LA-DiLoCo, a member of the DiLoCo family, analyzing how a one-worker “LA” variant and its multi-worker two-phase form behave on a high-dimensional linear regression task.
- It finds that the single-worker algorithm offers a different signal-versus-noise tradeoff compared with SGD, which can be advantageous in many settings.
- The multi-worker version produces more noise than the single-worker variant, but the paper shows that proper hyperparameter choices can mitigate this extra noise.
- It extends the analysis to SLA (LA with momentum) and argues that combining two momentum operators can accelerate convergence by effectively reshaping the Hessian spectrum, with Nesterov momentum performing best.
- Overall, the work positions partially asynchronous two-phase optimizers as a promising new paradigm for understanding and improving optimization in larger training setups.
Related Articles

Black Hat Asia
AI Business
[D] How does distributed proof of work computing handle the coordination needs of neural network training?
Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside
Dev.to

BYOK is not just a pricing model: why it changes AI product trust
Dev.to

AI Citation Registries and Identity Persistence Across Records
Dev.to