Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper studies why classic on-policy RL methods (e.g., PPO, GRPO, REINFORCE++) are sample-inefficient for LLM/VLM post-training, since they discard trajectories after each update, which is costly for multi-turn agentic tasks.
It argues that directly using Prioritized Experience Replay (PER) with LLMs/VLMs fails because rapidly changing policies make stored priorities go stale, causing uninformative trajectories to remain overly sampled.
The authors propose “Freshness-Aware PER,” which fixes priority staleness by adding a multiplicative exponential age decay to PER priorities, motivated by effective sample size analysis.
Experiments on eight multi-step agentic/reasoning/math tasks using 0.5B, 3B, and 7B models show large gains over on-policy baselines (e.g., +46% on NQ Search, +367% on Sokoban, +133% on VLM FrozenLake) and degraded results when using standard PER without age decay.
The implementation is released publicly via GitHub, enabling practitioners to try the method in LLM/VLM reinforcement learning pipelines.

Abstract

Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision-CAIR/Freshness-Aware-PER.

A practical guide to getting comfortable with AI coding tools

Dev.to

Competitive Map: 10 AI Agent Platforms vs AgentHansa

Dev.to

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

Key Points

Abstract

Related Articles

A practical guide to getting comfortable with AI coding tools

Competitive Map: 10 AI Agent Platforms vs AgentHansa

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer