Beyond Importance Sampling: Rejection-Gated Policy Optimization

arXiv cs.LG / 4/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Rejection-Gated Policy Optimization (RGPO), which learns a differentiable acceptance gate to decide which samples are trustworthy for policy updates instead of reweighting all samples by importance ratios.
  • RGPO replaces the importance-sampling ratio with a smooth acceptance function alpha_theta(s, a)=g(r_theta(s, a)) and incorporates this gate directly into gradient computation so it is updated implicitly alongside the policy.
  • The authors show RGPO yields finite, bounded gradient variance even when importance sampling ratios are heavy-tailed, where standard importance-sampling variance can diverge.
  • RGPO recovers key policy-gradient methods (TRPO, PPO, and REINFORCE) as special cases via an effective gradient weight w(r)=g'(r)*r, and it provides an approximate monotonic improvement guarantee with bounded, controllable bias.
  • Experiments on online preference fine-tuning of Qwen2.5-1.5B-Instruct using Anthropic HH-RLHF report Pareto-dominant results versus PPO-RLHF, including higher reward (+14.8%) and lower KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).

Abstract

We propose a new perspective on policy optimization: rather than reweighting all samples by their importance ratios, an optimizer should select which samples are trustworthy enough to drive a policy update. Building on this view, we introduce Rejection-Gated Policy Optimization (RGPO), which replaces the importance sampling ratio r_theta = pi_theta / pi_old with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) in the range [0, 1]. Unlike prior work that applies rejection sampling as a data-level heuristic before training, RGPO elevates rejection to an optimization principle: the gate participates directly in gradient computation and is implicitly updated alongside the policy. RGPO provides a unified framework: the policy gradients of TRPO, PPO, and REINFORCE all correspond to specific choices of the effective gradient weight w(r) = g'(r) * r. We prove that RGPO guarantees finite, bounded gradient variance even when importance sampling ratios are heavy-tailed (where IS variance diverges). We further show that RGPO incurs only a bounded, controllable bias and provides an approximate monotonic policy improvement guarantee analogous to TRPO. RGPO matches PPO in computational cost, requires no second-order optimization, and extends naturally to RLHF-style preference alignment. In online preference fine-tuning of Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF (n = 3 seeds), RGPO uses a dual-ratio gate that anchors learning to both the previous policy and the reference model, achieving a Pareto-dominant outcome: the highest reward among online RL methods (+14.8% vs. PPO-RLHF) and the lowest KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).