ANO: A Principled Approach to Robust Policy Optimization

arXiv cs.AI / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that PPO’s hard clipping discards useful gradient information from outliers, while removing clipping (as in SPO) can lead to unbounded gradients and severe instability.
  • It introduces a Unified Trust Region Framework and derives Anchored Neighborhood Optimization (ANO) from explicit design principles.
  • ANO is motivated by a “Redescending Influence Principle,” replacing monotonic penalties and hard-thresholding with dynamic suppression of outliers to improve stability under high-variance stochastic optimization.
  • The authors provide theoretical results that ANO has the minimal structural complexity needed for robust optimization, and they prove the necessity of the proposed principle for stability.
  • Experiments on MuJoCo show ANO achieving state-of-the-art results versus PPO and SPO, including substantially better stability even with aggressive hyperparameters where PPO completely fails.

Abstract

Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clipping (as in SPO) exposes optimization to unbounded gradients, causing significant instability and hyperparameter sensitivity. To resolve this, we establish a Unified Trust Region Framework that generalizes existing objectives. Within this framework, we derive Anchored Neighborhood Optimization (ANO) based on a set of design principles. We identify that the failure of standard policy gradients stems from a misapplication of gradient influence on outliers. We propose the Redescending Influence Principle, a paradigm shift from monotonic penalties (SPO) and hard-thresholding (PPO) to dynamic outlier suppression, and prove its necessity for stability in high-variance stochastic optimization. Theoretically, we prove ANO possesses the minimal structural complexity required for robust optimization. Empirically, ANO achieves state-of-the-art performance on MuJoCo benchmarks, significantly outperforming PPO and SPO. Notably, ANO demonstrates superior stability, preventing policy collapse even under aggressive hyperparameters (e.g., learning rates 3x larger than standard) where PPO fails completely.