Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
arXiv cs.CL / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper surveys LLM post-training methods and proposes a unified framework based on how they intervene on model behavior rather than on differing objective labels alone.
- It organizes learning into two regimes—off-policy learning from externally supplied trajectories and on-policy learning from learner-generated rollouts—then further explains methods via roles like effective support expansion and policy reshaping.
- The authors add a systems-level concept, behavioral consolidation, to describe how techniques preserve, transfer, and amortize behaviors across training stages and model transitions.
- The framework maps major paradigms (e.g., SFT, preference optimization, on-policy RL, distillation) to these roles, arguing that SFT and preference methods often correspond to different behavioral bottlenecks.
- The paper concludes that improving post-training increasingly depends on coordinated system/stage design instead of any single dominant training objective.



