Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation

arXiv cs.AI / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses practical LLM unlearning as a multi-objective problem, requiring removal of harmful or privacy-leaking knowledge while also maintaining general utility, reducing over-refusal, and improving robustness to adversarial probing.
  • It argues that prior methods typically cover only a subset of these objectives, and that naive multi-objective extensions can cause interference between unlearning tasks.
  • The proposed approach harmonizes objectives via data-and-optimization co-design by unifying training corpora into a single domain representation to reduce domain gaps.
  • It introduces bidirectional logit distillation that both extracts desired behavior from a context-instructed teacher and suppresses undesirable behaviors in the student.
  • The authors report theoretical and empirical evidence that the method aligns domain distributions and improves cooperative optimization, achieving state-of-the-art balanced and reliable unlearning performance.

Abstract

Large Language Models (LLMs) unlearning is crucial for removing hazardous or privacy-leaking information from the model. Practical LLM unlearning demands satisfying multiple challenging objectives simultaneously: removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and, crucially, ensuring robustness against adversarial probing attacks. However, existing unlearning methods primarily focus on a limited subset of these goals, typically unlearning efficacy and utility preservation while overlooking robustness and boundary behaviors. Naively extending these methods to multi-objective settings may lead to unlearning task interference. We propose a novel multi-objective unlearning framework that harmonizes multiple unlearning objectives through a data and optimization co-design: We standardize training corpora into a unified data representation to reduce the domain gap, and then introduce a bidirectional distillation method that simultaneously elicits desired behavior from a context-instructed teacher while suppressing undesirable behavior in the student model. Theoretical and empirical analyses show that our method aligns domain distributions and converts seemingly irrelevant unlearning tasks into cooperative optimization. Evaluation demonstrates state-of-the-art performance, which enables balanced and reliable unlearning across diverse, challenging requirements.