Representation-Guided Parameter-Efficient LLM Unlearning

arXiv cs.CL / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses “machine unlearning” for LLMs, focusing on the difficult forget–retain trade-off faced by existing parameter-efficient methods.
It argues that current unlearning approaches are limited because parameter importance metrics can’t reliably separate parameters tied to the forget set vs. the retain set due to superposition/polysemy in LLM representations.
The proposed method, REGLU, uses representation-space geometry to guide a LoRA-based unlearning process with (1) a representation-guided initialization to pick an optimal forgetting subspace and (2) a regularization loss that pushes the LoRA update into the orthogonal complement of the retain-set subspace.
Experiments on the TOFU and WMDP benchmarks across multiple models show that REGLU achieves better unlearning quality than prior approaches while preserving higher overall model utility.
The work is positioned as a robust and precise parameter-efficient unlearning technique that could improve how organizations remove sensitive or harmful content from deployed LLMs.

Abstract

Large Language Models (LLMs) often memorize sensitive or harmful information, necessitating effective machine unlearning techniques. While existing parameter-efficient unlearning methods have shown promise, they still struggle with the forget-retain trade-off. This can be attributed to their reliance on parameter importance metrics to identify parameters that are important exclusively for the forget set, which is fundamentally limited by the superposition phenomenon. Due to the polysemantic nature of LLM parameters, such an importance metric may struggle to disentangle parameters associated with the forget and retain sets. In this work, we propose Representation-Guided Low-rank Unlearning (REGLU), a novel approach that leverages the geometric properties of representation spaces to achieve robust and precise unlearning. First, we develop a representation-guided initialization for LoRA that identifies the optimal subspace for selective forgetting. Second, we introduce a regularization loss that constrains the outputs of the LoRA update to lie in the orthogonal complement of the retain set's representation subspace, thereby minimizing interference with the model's performance on the retain set. We evaluate REGLU on the TOFU and WMDP benchmarks across multiple models. Our results demonstrate that REGLU consistently outperforms state-of-the-art baselines, achieving superior unlearning quality while maintaining higher model utility.