Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

arXiv cs.LG / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Muon$^2$, an extension of the Muon optimizer that adds Adam-style adaptive second-moment preconditioning before Muon’s orthogonalization step.
It argues that Muon’s slowdown is driven by an ill-conditioned momentum matrix and that Muon$^2$ substantially improves its spectrum, enabling faster convergence to usable orthogonality.
Experiments on GPT and LLaMA pre-training (60M–1.3B parameters) show Muon$^2$ consistently outperforms Muon and newer Muon variants while reducing Newton–Schulz iterations by 40%.
The work evaluates orthogonalization quality using directional alignment and also proposes Muon$^2$-F, a memory-efficient factorized version that retains most of Muon$^2$’s benefits with minimal extra memory cost.

Abstract

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon

^2

, an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon

^2

, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon

^2

demonstrates dramatic improvement over Muon at each polar step. Across GPT and LLaMA pre-training experiments from 60M to 1.3B parameters, Muon

^2

consistently outperforms Muon and recent Muon variants while reducing NS iterations by 40\%. We further introduce Muon

^2

-F, a memory-efficient factorized variant that preserves most of the gains of Muon

^2

with negligible memory overhead.

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

Reddit r/artificial

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

🌱 Green Habit Tracker

Dev.to

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Key Points

Abstract

Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

🌱 Green Habit Tracker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer