Beyond Importance Sampling: Rejection-Gated Policy Optimization
arXiv cs.LG / 4/17/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Rejection-Gated Policy Optimization (RGPO), which learns a differentiable acceptance gate to decide which samples are trustworthy for policy updates instead of reweighting all samples by importance ratios.
- RGPO replaces the importance-sampling ratio with a smooth acceptance function alpha_theta(s, a)=g(r_theta(s, a)) and incorporates this gate directly into gradient computation so it is updated implicitly alongside the policy.
- The authors show RGPO yields finite, bounded gradient variance even when importance sampling ratios are heavy-tailed, where standard importance-sampling variance can diverge.
- RGPO recovers key policy-gradient methods (TRPO, PPO, and REINFORCE) as special cases via an effective gradient weight w(r)=g'(r)*r, and it provides an approximate monotonic improvement guarantee with bounded, controllable bias.
- Experiments on online preference fine-tuning of Qwen2.5-1.5B-Instruct using Anthropic HH-RLHF report Pareto-dominant results versus PPO-RLHF, including higher reward (+14.8%) and lower KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



