Unlearning Offline Stochastic Multi-Armed Bandits

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies machine unlearning for offline stochastic multi-armed bandits, addressing data-deletion requests and privacy risks without requiring full model retraining.
It formalizes privacy constraints for offline MAB and evaluates “utility” using the decision quality after unlearning.
The authors analyze both single- and multi-source unlearning under two data-generation regimes—the fixed-sample and distribution models—and provide algorithmic designs tailored to each.
Their methods build on two foundational components, the Gaussian mechanism and rollback, and include adaptive switching strategies plus a mixing procedure to explain why the baselines work.
The study includes theoretical guarantees (including lower bounds) and experiments that confirm the expected privacy–utility tradeoffs.

Abstract

Machine unlearning aims to unlearn data points from a learned model, offering a principled way to process data-deletion requests and mitigate privacy risks without full retraining. Prior work has mainly studied unsupervised / supervised machine unlearning, leaving unlearning for sequential decision-making systems far less understood. We initiate the first study of a foundational sequential decision-making problem: offline stochastic multi-armed bandits (MAB). We formalize the privacy constraint for offline MAB and measure utility by the post-unlearning decision quality. We conduct a systematic study of both single- and multi-source unlearning scenarios under two data-generation models, the fixed-sample model and the distribution model. For these settings, our algorithmic design is built on two canonical base algorithms: Gaussian mechanism and rollback, and we propose adaptive algorithms that switch between them according to the data regime and privacy constraint. We further introduce a mixing procedure that elucidates the rationale behind these baselines. We provide performance guarantees across the above settings and establish lower bounds under both dataset models. Experiments validate the predicted tradeoffs and demonstrate the effectiveness of the proposed methods.