Explainable LLM Unlearning Through Reasoning
arXiv cs.AI / 3/12/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that prior unlearning methods like gradient ascent are untargeted and can degrade general abilities or fail to fully remove knowledge, and it introduces a reasoning-based unlearning target to specify what should be forgotten and how the model should respond after unlearning.
- It proposes targeted reasoning unlearning (TRU), which uses the reasoning-based target as guidance and combines a cross-entropy supervised loss with a GA-based loss to learn precise knowledge removal while preserving unrelated abilities.
- The authors evaluate TRU across multiple benchmarks and LLM backbones, showing more reliable unlearning and preserved general capabilities, along with increased robustness under diverse attack scenarios.
- They present reasoning-augmented unlearning as a practical, explainable paradigm for safe, reliable LLM unlearning, with implications for safety, copyright, and privacy concerns.




