Reinforcement Learning for LLM Post-Training: A Survey

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper surveys reinforcement learning (RL) based post-training methods for large language models, focusing on how they address harmful, misaligned outputs and improve performance in areas like math and coding.
  • It highlights that while RLHF methods (e.g., DPO) and RLVR approaches with verifiable rewards (e.g., PPO, GRPO) have shown strong gains, prior work lacked a deeply technical, side-by-side comparison of these approaches.
  • The authors propose a unified policy-gradient framework that treats pretraining, SFT, RLHF, and RLVR as special cases, connecting foundational techniques with newer advancements.
  • The survey provides detailed breakdowns of key algorithmic choices—such as prompt sampling, response sampling, and gradient coefficient axes—and standardizes notation to enable direct cross-method comparisons.
  • It also compares implementation details and empirical results for each method, aiming to serve as a technical reference for researchers and practitioners.

Abstract

Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training methods, including Reinforcement Learning from Human Feedback (RLHF) methods like Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches like PPO and GRPO, have made remarkable gains to alleviate these issues. Yet, no existing work offers a technically detailed comparison of the various methods driving this progress. In order to fill this gap, we present a timely survey that connects foundational components with latest advancements. We derive a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The main contributions of our survey are as follows: (1) a self-contained introduction to MLE, RLHF, and RLVR foundations and the unified policy gradient framework; (2) detailed technical analysis of PPO- and GRPO-based methods alongside offline and iterative DPO approaches, decomposed along prompt sampling, response sampling, and gradient coefficient axes; (3) standardized notation enabling direct cross-method comparison; and (4) comprehensive comparison of implementation details and empirical results of each method in the appendix. We aim to serve as a technically grounded reference for researchers and practitioners working on LLM post-training.