Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning

arXiv cs.LG / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper examines how large reasoning models (LRMs) used in right-to-be-forgotten workflows can develop new security vulnerabilities during machine unlearning.
It proposes a new “LRM unlearning attack” that can force incorrect final answers while still producing plausible but misleading multi-step reasoning traces.
The authors highlight key technical obstacles for the attack, including non-differentiable logical constraints, weak optimization over long rationales, and discrete selection of what data to forget.
They introduce a bi-level exact unlearning attack method that uses differentiable objectives, influential token alignment, and a relaxed forget-set indicator strategy to improve optimization.
Extensive experiments are presented across white-box and black-box scenarios to show effectiveness and generalizability, with the intent of raising awareness for LRM unlearning pipeline defenses.

Abstract

Large language models (LLMs) possess strong semantic understanding, driving significant progress in data mining applications. This is further enhanced by large reasoning models (LRMs), which provide explicit multi-step reasoning traces. On the other hand, the growing need for the right to be forgotten has driven the development of machine unlearning techniques, which aim to eliminate the influence of specific data from trained models without full retraining. However, unlearning may also introduce new security vulnerabilities by exposing additional interaction surfaces. Although many studies have investigated unlearning attacks, there is no prior work on LRMs. To bridge the gap, we first in this paper propose LRM unlearning attack that forces incorrect final answers while generating convincing but misleading reasoning traces. This objective is challenging due to non-differentiable logical constraints, weak optimization effect over long rationales, and discrete forget set selection. To overcome these challenges, we introduce a bi-level exact unlearning attack that incorporates a differentiable objective function, influential token alignment, and a relaxed indicator strategy. To demonstrate the effectiveness and generalizability of our attack, we also design novel optimization frameworks and conduct comprehensive experiments in both white-box and black-box settings, aiming to raise awareness of the emerging threats to LRM unlearning pipelines.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/7DailyView insight →

Black Hat Asia

AI Business

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Reddit r/MachineLearning

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Dev.to

Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence

Dev.to

Stop waiting for Java to rebuild! AI IDEs + Zero-Latency Hot Reload = Magic

Dev.to

Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence

Stop waiting for Java to rebuild! AI IDEs + Zero-Latency Hot Reload = Magic

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer