StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

arXiv cs.AI / 4/20/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a key limitation of SignSGD—its tendency to diverge on non-smooth objectives common in modern ML such as ReLUs, max-pooling, and mixture-of-experts.
It proposes StoSignSGD, which injects structural stochasticity into the sign operator while keeping the gradient update step unbiased.
In (online) convex settings, the authors prove StoSignSGD resolves SignSGD’s non-convergence with a sharp convergence rate that matches known lower bounds.
For non-convex, non-smooth optimization, they introduce generalized stationary measures and show improved complexity bounds (including dimensional factor improvements).
Empirically, StoSignSGD is reported to be more stable and efficient for large language model training, including FP8 pretraining where AdamW fails, and it also improves fine-tuning performance on 7B math reasoning tasks; the work includes an optimizer-to-unbiased sign-based conversion framework and extensive ablations.

Abstract

Sign-based optimization algorithms, such as SignSGD, have garnered significant attention for their remarkable performance in distributed learning and training large foundation models. Despite their empirical superiority, SignSGD is known to diverge on non-smooth objectives, which are ubiquitous in modern machine learning due to ReLUs, max-pools, and mixture-of-experts. To overcome this fundamental limitation, we propose \textbf{StoSignSGD}, an algorithm that injects structural stochasticity into the sign operator while maintaining an unbiased update step. In the regime of (online) convex optimization, our theoretical analysis shows that StoSignSGD rigorously resolves the non-convergence issues of SignSGD, achieving a sharp convergence rate matching the lower bound. For the more challenging non-convex non-smooth optimization, we introduce generalized stationary measures that encompass prior definitions, proving that StoSignSGD improves upon the best-known complexity bounds by dimensional factors. Empirically, StoSignSGD exhibits robust stability and superior efficiency across diverse large language model (LLM) training regimes. Notably, in low-precision FP8 pretraining -- a setting where AdamW fails catastrophically -- StoSignSGD remains highly stable and yields a remarkable 1.44

\times

to 2.14

\times

speedup relative to established baselines. Furthermore, when fine-tuning 7B LLMs on mathematical reasoning tasks, StoSignSGD delivers substantial performance gains over both AdamW and SignSGD. Finally, to dissect the mechanisms driving its success, we develop a sign conversion framework capable of transforming any general optimizer into its unbiased, sign-based counterpart. Utilizing this framework, we deconstruct the core components of StoSignSGD and present a comprehensive ablation study to empirically validate our algorithmic design choices.

Which Version of Qwen 3.6 for M5 Pro 24g

Reddit r/LocalLLaMA

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

Key Points

Abstract

Related Articles

Which Version of Qwen 3.6 for M5 Pro 24g

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Local LLM Beginner’s Guide (Mac - Apple Silicon)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer