Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

arXiv cs.LG / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Muon$^2$, an extension of the Muon optimizer that adds Adam-style adaptive second-moment preconditioning before Muon’s orthogonalization step.
  • It argues that Muon’s slowdown is driven by an ill-conditioned momentum matrix and that Muon$^2$ substantially improves its spectrum, enabling faster convergence to usable orthogonality.
  • Experiments on GPT and LLaMA pre-training (60M–1.3B parameters) show Muon$^2$ consistently outperforms Muon and newer Muon variants while reducing Newton–Schulz iterations by 40%.
  • The work evaluates orthogonalization quality using directional alignment and also proposes Muon$^2$-F, a memory-efficient factorized version that retains most of Muon$^2$’s benefits with minimal extra memory cost.

Abstract

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon^2, an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon^2, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon^2 demonstrates dramatic improvement over Muon at each polar step. Across GPT and LLaMA pre-training experiments from 60M to 1.3B parameters, Muon^2 consistently outperforms Muon and recent Muon variants while reducing NS iterations by 40\%. We further introduce Muon^2-F, a memory-efficient factorized variant that preserves most of the gains of Muon^2 with negligible memory overhead.