Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

arXiv cs.LG / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Prompt Replay,” an overhead-free online data selection method for GRPO that reuses only prompts (not full trajectories) to maintain on-policy optimization while reducing wasted compute from unusable prompts.
  • It stores medium-difficulty prompts in a buffer and prioritizes prompts whose expected pass rate is near 0.5 to maximize learning signal via advantage, then mixes reused prompts with fresh samples to form training batches.
  • The approach uses cooldown steps and maximum reuse limits to balance aggressiveness against the risk of overfitting, and the authors report faster initial accuracy gains on six math benchmarks.
  • Across multiple model families and datasets, Prompt Replay reduces zero-variance prompts and increases mean absolute advantage, but it ultimately plateaus and converges with the baseline when reuse aggressiveness is misconfigured.
  • The authors note that Qwen2.5-Math may show spurious-reward effects that can invalidate ablation conclusions, suggesting caution when using it as the sole GRPO research testbed.

Abstract

Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.