Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem

arXiv cs.CL / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper reframes LLM “unlearning” as an asymmetric two-task learning setup where retaining general capability is the primary objective and forgetting targeted knowledge is an auxiliary objective.
It proposes a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from a conflict-aware method for combining gradients.
Using this framework, the authors adapt PCGrad for conflict resolution and introduce SAGO, a new retention-prioritized gradient synthesis method based on constructive sign-constrained synthesis.
Theoretical analysis shows both methods maintain non-negative cosine similarity with the retain gradient, while SAGO provides strictly tighter alignment.
Experiments on WMDP Bio/Cyber and RWKU demonstrate improved Pareto-optimal trade-offs, with WMDP Bio SimNPO+GD target-model MMLU recovery rising from 44.6% (naive) to 94.0% (+PCGrad) and 96.0% (+SAGO) while preserving comparable forgetting strength.

Abstract

Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.