AI Navigate

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

arXiv cs.LG / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • MHPO proposes a Modulated Hazard-aware Policy Optimization framework to boost stability in GRPO-based reinforcement learning by addressing non-differentiable ratio clipping and gradient fidelity issues.
  • It introduces a Log-Fidelity Modulator (LFM) that maps unbounded importance ratios into a bounded, differentiable domain to limit the impact of high-variance outliers on the loss landscape.
  • It further adds a Decoupled Hazard Penalty (DHP) that uses cumulative hazard functions to independently regulate positive and negative policy shifts, reducing mode collapse and policy erosion within a stabilized trust region.
  • The approach is evaluated on diverse reasoning benchmarks across text-based and vision-language tasks, where MHPO outperforms existing methods and improves training stability.
  • Overall, MHPO provides finer-grained regulation of policy updates, enabling more robust and reliable reinforcement learning training.

Abstract

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.