Beyond Importance Sampling: Rejection-Gated Policy Optimization

arXiv cs.LG / 4/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Rejection-Gated Policy Optimization (RGPO), which learns a differentiable acceptance gate to decide which samples are trustworthy for policy updates instead of reweighting all samples by importance ratios.
RGPO replaces the importance-sampling ratio with a smooth acceptance function alpha_theta(s, a)=g(r_theta(s, a)) and incorporates this gate directly into gradient computation so it is updated implicitly alongside the policy.
The authors show RGPO yields finite, bounded gradient variance even when importance sampling ratios are heavy-tailed, where standard importance-sampling variance can diverge.
RGPO recovers key policy-gradient methods (TRPO, PPO, and REINFORCE) as special cases via an effective gradient weight w(r)=g'(r)*r, and it provides an approximate monotonic improvement guarantee with bounded, controllable bias.
Experiments on online preference fine-tuning of Qwen2.5-1.5B-Instruct using Anthropic HH-RLHF report Pareto-dominant results versus PPO-RLHF, including higher reward (+14.8%) and lower KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).

Abstract

We propose a new perspective on policy optimization: rather than reweighting all samples by their importance ratios, an optimizer should select which samples are trustworthy enough to drive a policy update. Building on this view, we introduce Rejection-Gated Policy Optimization (RGPO), which replaces the importance sampling ratio r_theta = pi_theta / pi_old with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) in the range [0, 1]. Unlike prior work that applies rejection sampling as a data-level heuristic before training, RGPO elevates rejection to an optimization principle: the gate participates directly in gradient computation and is implicitly updated alongside the policy. RGPO provides a unified framework: the policy gradients of TRPO, PPO, and REINFORCE all correspond to specific choices of the effective gradient weight w(r) = g'(r) * r. We prove that RGPO guarantees finite, bounded gradient variance even when importance sampling ratios are heavy-tailed (where IS variance diverges). We further show that RGPO incurs only a bounded, controllable bias and provides an approximate monotonic policy improvement guarantee analogous to TRPO. RGPO matches PPO in computational cost, requires no second-order optimization, and extends naturally to RLHF-style preference alignment. In online preference fine-tuning of Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF (n = 3 seeds), RGPO uses a dual-ratio gate that anchors learning to both the previous policy and the reference model, achieving a Pareto-dominant outcome: the highest reward among online RL methods (+14.8% vs. PPO-RLHF) and the lowest KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/17DailyView insight →

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

Reddit r/artificial

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

🌱 Green Habit Tracker

Dev.to

Beyond Importance Sampling: Rejection-Gated Policy Optimization

Key Points

Abstract

💡 Insights using this article

Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

🌱 Green Habit Tracker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer