Zeroth-Order Optimization at the Edge of Stability

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies zeroth-order (ZO) optimization methods using a two-point gradient-free estimator and derives an explicit step-size condition for mean-square linear stability.
It shows a key difference from first-order (FO) optimization: FO stability depends mainly on the largest Hessian eigenvalue, while ZO stability is influenced by the entire Hessian spectrum.
Because full Hessian eigenspectrum computation is impractical, the authors provide practical stability bounds that require only the largest eigenvalue and the Hessian trace.
Experiments indicate that several full-batch ZO methods (ZO-GD, ZO-GDM, and ZO-Adam) tend to run near the predicted “edge of stability” boundary across multiple deep-learning training tasks.
The findings suggest ZO methods have an implicit regularization effect where large step sizes mainly regularize the Hessian trace (unlike FO methods, which target the top eigenvalue).

Abstract

Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.