DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
arXiv cs.CL / 2026/3/25
💬 オピニオンIdeas & Deep AnalysisModels & Research
要点
- The paper identifies a vulnerability in safety-aligned LLMs to backdoor attacks using model editing, but shows that many existing methods become unstable due to “safety fallback,” where the model later reverts to refusals after an affirmative start.
- It proposes DualEdit, a dual-objective editing framework that both promotes affirmative tokens and suppresses refusal tokens during generation.
- DualEdit mitigates optimization and generalization issues by using dynamic loss weighting to balance objectives and value anchoring to reduce conflicts caused by diverse refusal/affirmation tokens.
- Experiments on safety-aligned LLMs report that DualEdit increases attack success by about 10% and reduces safety fallback rate by about 11% compared with baseline approaches.




