Target Policy Optimization

arXiv cs.LG / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Target Policy Optimization (TPO), an RL method that decouples selecting which sampled completions to reinforce from deciding how to update the policy parameters.
  • TPO builds a target distribution over scored completions as q_i ∝ p_i^old * exp(u_i) and trains the policy via cross-entropy, yielding a logit gradient of p^θ − q that becomes zero when the policy matches the target.
  • Experiments across tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR show TPO matches standard policy-gradient family methods on easier tasks.
  • Under sparse reward settings, TPO substantially outperforms PG, PPO, GRPO, and DG, suggesting improved robustness versus overshoot/undershoot issues tied to learning rate and optimization choices.
  • The authors provide an open-source implementation at the linked GitHub repository, facilitating reproduction and adoption.

Abstract

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution q_i \propto p_i^{\,\mathrm{old}} \exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is p^\theta - q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.