MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

arXiv cs.LG / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MHPO proposes a Modulated Hazard-aware Policy Optimization framework to boost stability in GRPO-based reinforcement learning by addressing non-differentiable ratio clipping and gradient fidelity issues.
It introduces a Log-Fidelity Modulator (LFM) that maps unbounded importance ratios into a bounded, differentiable domain to limit the impact of high-variance outliers on the loss landscape.
It further adds a Decoupled Hazard Penalty (DHP) that uses cumulative hazard functions to independently regulate positive and negative policy shifts, reducing mode collapse and policy erosion within a stabilized trust region.
The approach is evaluated on diverse reasoning benchmarks across text-based and vision-language tasks, where MHPO outperforms existing methods and improves training stability.
Overall, MHPO provides finer-grained regulation of policy updates, enabling more robust and reliable reinforcement learning training.

Abstract

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.

Astral to Join OpenAI

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Why Data is Important for LLM

Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

Dev.to

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

Key Points

Abstract

Related Articles

Astral to Join OpenAI

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Why Data is Important for LLM

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer