Data Deletion Can Help in Adaptive RL

arXiv cs.LG / 5/4/2026

📰 NewsModels & Research

共有:

Key Points

The paper studies adaptive reinforcement learning in time-varying environments using a contextual MDP setup where the context is low-dimensional and unknown at test time.
It improves the context-estimation approach by introducing a simple trick: randomly deleting a fraction of the training replay buffer after each round.
Random deletion implicitly downweights older, off-distribution trajectories collected under earlier policies, reducing the estimator’s robustness gap by about 30% for MLPs and 6% on average for recurrent networks.
The method also enables smaller models (e.g., an MLP with 5× fewer parameters) to outperform larger MLP baselines trained without deletion.
The authors provide theoretical analysis via mismatch-aware regularized risk minimization, proving that uniform random deletion can reduce expected test loss, and deriving concrete conditions (e.g., for ridge regression) tied to regularization strength and SNR-based mismatch thresholds.

Abstract

Deploying reinforcement learning policies in the real world requires adapting to time-varying environments. We study this problem in the contextual Markov Decision Process (cMDP) framework, where a family of environments is indexed by a low-dimensional context unknown at test time. The standard approach decomposes the problem: train a so-called "universal policy" which assumes knowledge of the true context, then pair it with a context estimator which approximates context using the observed trajectory. We identify a simple, counterintuitive trick that substantially improves the estimator: randomly delete a fraction of the training buffer after each round. This works because data is collected across multiple rounds using progressively better policies, and older trajectories come from a different distribution than what the estimator will face at deployment time; random deletion creates an implicit exponential decay on older data while preserving diversity without requiring any explicit identification of which samples are stale. This reduces robustness gap by 30% for MLPs and by 6% on average for recurrent networks. Strikingly, it allows a narrow MLP with 5x fewer parameters to outperform a wide MLP trained without deletion. To understand when and why deletion helps, we analyze regularized empirical risk minimization with a mismatch between the train distribution and the distribution at deployment; in this idealized setting, we prove that removing a single uniformly random training point decreases expected test loss in expectation under mild conditions. For ridge regression we make this quantitative: deletion helps when the regularization coefficient is moderate and the signal-to-noise ratio (SNR) is sufficiently low, and, crucially, this SNR threshold gives a direct measure of how large the distribution mismatch between training and deployment must be for deletion to be beneficial.