Reinforcement Learning for LLM Post-Training: A Survey

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper surveys reinforcement learning (RL) based post-training methods for large language models, focusing on how they address harmful, misaligned outputs and improve performance in areas like math and coding.
It highlights that while RLHF methods (e.g., DPO) and RLVR approaches with verifiable rewards (e.g., PPO, GRPO) have shown strong gains, prior work lacked a deeply technical, side-by-side comparison of these approaches.
The authors propose a unified policy-gradient framework that treats pretraining, SFT, RLHF, and RLVR as special cases, connecting foundational techniques with newer advancements.
The survey provides detailed breakdowns of key algorithmic choices—such as prompt sampling, response sampling, and gradient coefficient axes—and standardizes notation to enable direct cross-method comparisons.
It also compares implementation details and empirical results for each method, aiming to serve as a technical reference for researchers and practitioners.

Abstract

Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training methods, including Reinforcement Learning from Human Feedback (RLHF) methods like Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches like PPO and GRPO, have made remarkable gains to alleviate these issues. Yet, no existing work offers a technically detailed comparison of the various methods driving this progress. In order to fill this gap, we present a timely survey that connects foundational components with latest advancements. We derive a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The main contributions of our survey are as follows: (1) a self-contained introduction to MLE, RLHF, and RLVR foundations and the unified policy gradient framework; (2) detailed technical analysis of PPO- and GRPO-based methods alongside offline and iterative DPO approaches, decomposed along prompt sampling, response sampling, and gradient coefficient axes; (3) standardized notation enabling direct cross-method comparison; and (4) comprehensive comparison of implementation details and empirical results of each method in the appendix. We aim to serve as a technically grounded reference for researchers and practitioners working on LLM post-training.

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

Dev.to

When a memorized rule fits your bug too well: a meta-trap of agent workflows

Dev.to

Reinforcement Learning for LLM Post-Training: A Survey

Key Points

Abstract

Related Articles

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

When a memorized rule fits your bug too well: a meta-trap of agent workflows

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer