[D] Howcome Muon is only being used for Transformers?

Reddit r/MachineLearning / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The post notes that Muon is being adopted for LLM/Transformer training, but it is rarely discussed or used for other model types like ConvNets despite being announced with a CIFAR-10 training speed record.
  • It raises the question of why Muon appears transformer-centric, including whether the approach fails to scale beyond that domain or if related research has been overlooked.
  • The author links the issue to the general expectation that faster training methods often correlate with improved final model quality, making the lack of broader usage noteworthy.
  • Overall, the content frames Muon’s current usage pattern as an open research/engineering signal rather than settled best practice.

Muon has quickly been adopted in LLM training, yet we don't see it being talked about in other contexts. Searches for Muon on ConvNets turn up basically no results, despite its announcement including a new training speed record for Cifar-10. In my experience faster training usually comes with better final models, so what's the deal? Does it not actually scale? Have I missed papers?

submitted by /u/lukeiy
[link] [comments]