Anon: Extrapolating Optimizer Adaptivity Across the Real Spectrum

arXiv cs.AI / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper explains that the generalization gap between adaptive optimizers (e.g., Adam) and non-adaptive methods (e.g., SGD) can stem from how adaptivity in pre-conditioners restricts the optimizer’s ability to handle diverse optimization landscapes.
  • It introduces Anon, an optimizer with continuously tunable adaptivity that can interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond them.
  • To maintain convergence across the full range of adaptivity, the authors propose Incremental Delay Update (IDU), which they claim is more flexible than AMSGrad’s hard max-tracking and is more robust to gradient noise.
  • The work provides theoretical convergence guarantees in both convex and non-convex settings and reports empirical improvements over state-of-the-art optimizers on image classification, diffusion, and language modeling.
  • Overall, the authors argue that adaptivity can be treated as a tunable design principle, offering a unified framework connecting classical and modern optimization behaviors.

Abstract

Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity in R, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.