Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks

arXiv cs.CV / 5/6/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that existing LVLM unlearning benchmarks can give unreliable results because they assume models first learn the target information, whereas many models actually fail at effective initial memorization.
It identifies two key causes of this “stage 1 failure,” namely under-memorization and a “multi-hop curse,” which prevent accurate diagnosis of unlearning behavior.
To address the problem, the authors introduce ReMem, a Reliable Multi-hop and Multi-image Memorization Benchmark designed to make foundational learning robust via principled data scaling, reasoning-aware question-answer pairs, and diverse visual contexts.
The work also proposes an “Exposure” metric to measure how deeply information is erased in the model’s internal probability distribution.
Experiments are presented showing ReMem offers a more rigorous and trustworthy framework for evaluating both learning and unlearning in large vision-language models.

Abstract

While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious identities but overlook a critical stage 1 failure: models fail to effectively memorize target information initially, rendering subsequent unlearning evaluations unreliable. Diagnosing under-memorization and the multi-hop curse as root causes, we introduce ReMem, a Reliable Multi-hop and Multi-image Memorization Benchmark. ReMem ensures robust foundational learning through principled data scaling, reasoning-aware QA pairs, and diverse visual contexts. Additionally, we propose a novel Exposure metric to quantify the depth of information erasure from the model's internal probability distribution. Extensive experiments demonstrate that ReMem provides a rigorous and trustworthy framework for diagnosing both learning and unlearning behaviors in LVLMs.

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Reddit r/LocalLLaMA

We measured the real cost of running a GPT-5.4 chatbot on live websites

Reddit r/artificial

AI ecosystems in China and US grow apart amid tech war

SCMP Tech

Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks

Key Points

Abstract

Related Articles

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Solidity LM surpasses Opus

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

We measured the real cost of running a GPT-5.4 chatbot on live websites

AI ecosystems in China and US grow apart amid tech war

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer