Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing training-based “visual concept unlearning” can confound evaluation because fine-tuning on a small forget set already harms general capability before unlearning is measured.
  • It introduces VLM-UnBench, a new benchmark for training-free visual concept unlearning, spanning multiple forgetting levels, datasets, and concept axes with probe and evaluation conditions designed to distinguish true forgetting from mere instruction-following.
  • Across many VLM configurations and evaluation setups, realistic unlearning prompts achieve forget accuracy close to the no-instruction baseline, while meaningful improvements only appear under special “oracle” conditions that effectively reveal the target concept.
  • Object and scene concepts are found to be especially resistant to suppression, and instruction-tuned models can still retain relevant visual knowledge even when explicitly instructed to forget.
  • Overall, the results highlight a gap between prompt-level suppression (instruction compliance) and true visual concept erasure (removal of underlying representations).

Abstract

VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.