Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning
arXiv cs.LG / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper examines how large reasoning models (LRMs) used in right-to-be-forgotten workflows can develop new security vulnerabilities during machine unlearning.
- It proposes a new “LRM unlearning attack” that can force incorrect final answers while still producing plausible but misleading multi-step reasoning traces.
- The authors highlight key technical obstacles for the attack, including non-differentiable logical constraints, weak optimization over long rationales, and discrete selection of what data to forget.
- They introduce a bi-level exact unlearning attack method that uses differentiable objectives, influential token alignment, and a relaxed forget-set indicator strategy to improve optimization.
- Extensive experiments are presented across white-box and black-box scenarios to show effectiveness and generalizability, with the intent of raising awareness for LRM unlearning pipeline defenses.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business
[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]
Reddit r/MachineLearning
Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds
Dev.to
Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence
Dev.to
Stop waiting for Java to rebuild! AI IDEs + Zero-Latency Hot Reload = Magic
Dev.to