Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

arXiv cs.LG / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper revisits SignSGD by analyzing it through a 1-bit quantization and dithering lens, addressing its known generalization gap versus well-tuned SGD caused by losing gradient magnitude information.
  • It derives a small-batch convergence rate for SignSGD under unimodal symmetric gradient noise, using a signal-to-noise weighted stationarity metric and removing prior large-batch assumptions.
  • The authors improve performance by injecting annealed Gaussian noise before the sign operator (classical dithering), which probabilistically recovers some magnitude information lost to hard thresholding.
  • They adapt SWATS to sign-based updates using projection-based learning-rate calibration to smoothly transition from SignSGD behavior toward SGD.
  • Experiments on a single worker with ResNet-18 isolate optimizer effects from communication costs, showing that pre-sign dithering beats Adam on CIFAR-100 and the calibrated switching strategy achieves 92.18% on CIFAR-10, outperforming both pure SGD and pure SignSGD with momentum.

Abstract

SignSGD compresses each stochastic gradient coordinate to a single bit, offering substantial memory and communication savings, but its 1-bit quantization removes magnitude information and is known to leave a generalization gap relative to well-tuned SGD. We revisit SignSGD from a 1-bit quantization and dithering perspective and contribute three improvements. First, we derive a small-batch convergence rate for SignSGD under unimodal symmetric gradient noise using a signal-to-noise weighted stationarity measure, removing the large-batch assumption of prior analyses. Second, we inject annealed Gaussian noise before the sign operator, which acts as a classical dithering mechanism and probabilistically restores magnitude information lost to hard thresholding. Third, we adapt the SWATS strategy to sign-based updates with a projection-based learning-rate calibration that smoothly transitions from SignSGD to SGD. Single-worker experiments on ResNet-18 isolate optimizer effects from communication aspects: pre-sign dithering surpasses Adam on CIFAR-100, and the calibrated switch reaches 92.18% test accuracy on CIFAR-10, outperforming both pure SGD 91.38% and pure SignSGD with momentum 90.82%.