Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing training-based “visual concept unlearning” can confound evaluation because fine-tuning on a small forget set already harms general capability before unlearning is measured.
It introduces VLM-UnBench, a new benchmark for training-free visual concept unlearning, spanning multiple forgetting levels, datasets, and concept axes with probe and evaluation conditions designed to distinguish true forgetting from mere instruction-following.
Across many VLM configurations and evaluation setups, realistic unlearning prompts achieve forget accuracy close to the no-instruction baseline, while meaningful improvements only appear under special “oracle” conditions that effectively reveal the target concept.
Object and scene concepts are found to be especially resistant to suppression, and instruction-tuned models can still retain relevant visual knowledge even when explicitly instructed to forget.
Overall, the results highlight a gap between prompt-level suppression (instruction compliance) and true visual concept erasure (removal of underlying representations).

Abstract

VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.