Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
arXiv cs.LG / 3/18/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper derives hyperparameter scaling laws for modern first-order optimizers by analyzing convergence bounds within the Linear Minimization Oracle (LMO) framework, covering optimizers like normalized SGD, signSGD, and Muon.
- Treating these bounds as proxies, the authors obtain closed-form power-law schedules for learning rate, momentum, and batch size as functions of iteration or token budget.
- With model size fixed, the analysis recovers known insights from the literature under a unified perspective and highlights the interaction between momentum and batch-size scaling.
- The results indicate multiple viable scaling strategies for achieving optimal performance and outline directions for future research.




