Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts
arXiv cs.LG / 3/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces “Prompt Replay,” an overhead-free online data selection method for GRPO that reuses only prompts (not full trajectories) to maintain on-policy optimization while reducing wasted compute from unusable prompts.
- It stores medium-difficulty prompts in a buffer and prioritizes prompts whose expected pass rate is near 0.5 to maximize learning signal via advantage, then mixes reused prompts with fresh samples to form training batches.
- The approach uses cooldown steps and maximum reuse limits to balance aggressiveness against the risk of overfitting, and the authors report faster initial accuracy gains on six math benchmarks.
- Across multiple model families and datasets, Prompt Replay reduces zero-variance prompts and increases mean absolute advantage, but it ultimately plateaus and converges with the baseline when reuse aggressiveness is misconfigured.
- The authors note that Qwen2.5-Math may show spurious-reward effects that can invalidate ablation conclusions, suggesting caution when using it as the sole GRPO research testbed.
Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4
Dev.to
How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis
Dev.to
AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?
Dev.to
[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly
Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team
THE DECODER