Target Policy Optimization

arXiv cs.LG / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Target Policy Optimization (TPO), an RL method that decouples selecting which sampled completions to reinforce from deciding how to update the policy parameters.
TPO builds a target distribution over scored completions as q_i ∝ p_i^old * exp(u_i) and trains the policy via cross-entropy, yielding a logit gradient of p^θ − q that becomes zero when the policy matches the target.
Experiments across tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR show TPO matches standard policy-gradient family methods on easier tasks.
Under sparse reward settings, TPO substantially outperforms PG, PPO, GRPO, and DG, suggesting improved robustness versus overshoot/undershoot issues tied to learning rate and optimization choices.
The authors provide an open-source implementation at the linked GitHub repository, facilitating reproduction and adoption.

Abstract

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution

q_i \propto p_i^{\,\mathrm{old}} \exp(u_i)

and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is

p^\theta - q

, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.

Black Hat Asia

AI Business

The enforcement gap: why finding issues was never the problem

Dev.to

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool

Dev.to

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises

Dev.to

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently

Dev.to

Target Policy Optimization

Key Points

Abstract

Related Articles

Black Hat Asia

The enforcement gap: why finding issues was never the problem

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer