Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

arXiv cs.LG / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper revisits SignSGD by analyzing it through a 1-bit quantization and dithering lens, addressing its known generalization gap versus well-tuned SGD caused by losing gradient magnitude information.
It derives a small-batch convergence rate for SignSGD under unimodal symmetric gradient noise, using a signal-to-noise weighted stationarity metric and removing prior large-batch assumptions.
The authors improve performance by injecting annealed Gaussian noise before the sign operator (classical dithering), which probabilistically recovers some magnitude information lost to hard thresholding.
They adapt SWATS to sign-based updates using projection-based learning-rate calibration to smoothly transition from SignSGD behavior toward SGD.
Experiments on a single worker with ResNet-18 isolate optimizer effects from communication costs, showing that pre-sign dithering beats Adam on CIFAR-100 and the calibrated switching strategy achieves 92.18% on CIFAR-10, outperforming both pure SGD and pure SignSGD with momentum.

Abstract

SignSGD compresses each stochastic gradient coordinate to a single bit, offering substantial memory and communication savings, but its 1-bit quantization removes magnitude information and is known to leave a generalization gap relative to well-tuned SGD. We revisit SignSGD from a 1-bit quantization and dithering perspective and contribute three improvements. First, we derive a small-batch convergence rate for SignSGD under unimodal symmetric gradient noise using a signal-to-noise weighted stationarity measure, removing the large-batch assumption of prior analyses. Second, we inject annealed Gaussian noise before the sign operator, which acts as a classical dithering mechanism and probabilistically restores magnitude information lost to hard thresholding. Third, we adapt the SWATS strategy to sign-based updates with a projection-based learning-rate calibration that smoothly transitions from SignSGD to SGD. Single-worker experiments on ResNet-18 isolate optimizer effects from communication aspects: pre-sign dithering surpasses Adam on CIFAR-100, and the calibrated switch reaches 92.18% test accuracy on CIFAR-10, outperforming both pure SGD 91.38% and pure SignSGD with momentum 90.82%.

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

Dev.to

Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

Key Points

Abstract

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer