Efficient RL Training for LLMs with Experience Replay

arXiv cs.LG / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether experience replay—reusing stored rollouts during training—can work effectively for LLM post-training despite the common belief that strictly on-policy, fresh data is required.
  • It formalizes the replay-buffer design problem for LLM post-training as a trade-off among replay staleness (variance), sample diversity, and the compute cost of generating new data.
  • The authors find that strict on-policy sampling can be suboptimal when generating new samples is expensive, implying replay can be a more compute-efficient training strategy.
  • Experiments indicate that an appropriately designed replay buffer can significantly reduce inference/compute needs while maintaining final model performance, and sometimes improving it, while preserving policy entropy.

Abstract

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.