The Newton-Muon Optimizer

arXiv cs.LG / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Muon optimizerの行列勾配直交化の設計原理を解明するために、損失を重み行列Wへの摂動に対する二次関数で近似するサロゲートモデルを提案し、その導出から新しい最適化手法を得ています。
サロゲートは勾配G、出力空間の曲率行列H、レイヤ入力を縦に積んだデータ行列Zの「3つの行列」だけで近似を行い、更新則は閉形式（モーメンタムとウェイトデケを除く）で $W \leftarrow W - \eta \cdot \mathrm{msgn}(G(ZZ^\top)^{-1})$ として与えられます。
提案手法Newton-Muonは、標準Muonが入力の2次モーメントが生む右側プレコンディショニングを暗黙に無視したニュートン型の方法として解釈できることを示しています。
実験では、GPT-2のpretrainingでMuonを用いたModded-NanoGPTの公開設定の再現において、目標検証損失到達までのイテレーション数が約6%減り、学習のウォールクロック時間も約4%短縮されたと報告しています。

Abstract

The Muon optimizer has received considerable attention for its strong performance in training large language models, yet the design principle behind its matrix-gradient orthogonalization remains largely elusive. In this paper, we introduce a surrogate model that not only sheds new light on the design of Muon, but more importantly leads to a new optimizer. In the same spirit as the derivation of Newton's method, the surrogate approximates the loss as a quadratic function of the perturbation to a weight matrix

W

using only three matrices: the gradient

G

, an output-space curvature matrix

H

, and the data matrix

Z

that stacks the layer inputs. By minimizing this surrogate in one step and adopting a certain isotropic assumption on the weights, we obtain the closed-form update rule (up to momentum and weight decay)

W \leftarrow W - \eta \cdot \mathrm{msgn}(G(ZZ^\top)^{-1})

, where

\eta

is the learning rate and

\mathrm{msgn}(X)=UV^\top

X=USV^\top

is a compact singular value decomposition. This new optimization method, which we refer to as Newton-Muon, shows that standard Muon can be interpreted as an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, on a reproduction of the earliest publicly released Modded-NanoGPT speedrun configuration using Muon for GPT-2 pretraining, Newton-Muon reaches the target validation loss in 6\% fewer iteration steps and reduces wall-clock training time by about 4\%.

Why I built an AI assistant that doesn't know who you are

Dev.to

DenseNet Paper Walkthrough: All Connected

Towards Data Science

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

Dev.to

The Facebook insider building content moderation for the AI era

TechCrunch

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

Reddit r/LocalLLaMA

The Newton-Muon Optimizer

Key Points

Abstract

Related Articles

Why I built an AI assistant that doesn't know who you are

DenseNet Paper Walkthrough: All Connected

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

The Facebook insider building content moderation for the AI era

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer