On the Convergence Analysis of Muon

arXiv stat.ML / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key gap in understanding Muon, an optimizer designed for neural network parameters with matrix structure rather than treating them as flattened vectors.
  • It provides a comprehensive convergence rate analysis of Muon and compares it to standard Gradient Descent (GD).
  • The authors derive conditions under which Muon is theoretically expected to outperform GD during training.
  • The analysis suggests Muon gains an advantage from the low-rank structure of Hessian matrices, which the authors note is common in real neural network optimization.
  • Experiments are used to support and corroborate the theoretical claims about convergence and performance benefits.

Abstract

The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank structure of Hessian matrices, a phenomenon widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.