Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
arXiv cs.CL / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper studies why classic on-policy RL methods (e.g., PPO, GRPO, REINFORCE++) are sample-inefficient for LLM/VLM post-training, since they discard trajectories after each update, which is costly for multi-turn agentic tasks.
- It argues that directly using Prioritized Experience Replay (PER) with LLMs/VLMs fails because rapidly changing policies make stored priorities go stale, causing uninformative trajectories to remain overly sampled.
- The authors propose “Freshness-Aware PER,” which fixes priority staleness by adding a multiplicative exponential age decay to PER priorities, motivated by effective sample size analysis.
- Experiments on eight multi-step agentic/reasoning/math tasks using 0.5B, 3B, and 7B models show large gains over on-policy baselines (e.g., +46% on NQ Search, +367% on Sokoban, +133% on VLM FrozenLake) and degraded results when using standard PER without age decay.
- The implementation is released publicly via GitHub, enabling practitioners to try the method in LLM/VLM reinforcement learning pipelines.
Related Articles

A practical guide to getting comfortable with AI coding tools
Dev.to

Competitive Map: 10 AI Agent Platforms vs AgentHansa
Dev.to

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to