MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes MuonEq, a set of lightweight pre-orthogonalization equilibration schemes for orthogonalized-update optimizers like Muon, designed to sit before finite-step Newton–Schulz orthogonalization.
MuonEq rebalances the momentum matrix using simple row/column squared-norm statistics via three variants (two-sided RC, row-only R, and column-only C) while requiring only O(m+n) auxiliary state.
The authors show the effectiveness of finite-step orthogonalization depends on the input matrix’s spectral properties, especially stable rank and condition number, linking optimization behavior to well-defined linear-algebraic factors.
They characterize row/column normalization as a zeroth-order whitening surrogate that mitigates scale mismatch and argue that the row-normalized variant R is the natural default for hidden weight matrices.
Experiments on LLaMA2 pretraining on C4 report that the default R variant consistently outperforms baseline Muon on 130M and 350M models with faster convergence and lower validation perplexity.

Abstract

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton--Schulz using row/column squared-norm statistics and only

\mathcal{O}(m+n)

auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the

\widetilde{\mathcal{O}}(T^{-1/4})

stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.