The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how common LLM fine-tuning approaches can be used both to create safety “misalignment” and to subsequently “realign” models after post-training, addressing adversarial abuse risks.
  • Across multiple safety-aligned LLMs and a set of four SFT and two PFT methods, the authors find an asymmetry: ORPO is most effective for misalignment attacks, while DPO is best for realignment.
  • The realignment improvements from DPO can come with a trade-off in overall model utility, highlighting performance-safety balance issues.
  • Results also show model-specific resistance and residual effects from multi-round adversarial dynamics, implying defenses may need to be tailored and robust over iterative interactions.
  • The work concludes that deploying untrusted third-party LLMs requires additional safeguards and customized safety alignment strategies, and it provides accompanying code for experimentation.

Abstract

The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at https://github.com/zhangrui4041/The-Art-of-Mis-alignment.