The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
arXiv cs.CL / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how common LLM fine-tuning approaches can be used both to create safety “misalignment” and to subsequently “realign” models after post-training, addressing adversarial abuse risks.
- Across multiple safety-aligned LLMs and a set of four SFT and two PFT methods, the authors find an asymmetry: ORPO is most effective for misalignment attacks, while DPO is best for realignment.
- The realignment improvements from DPO can come with a trade-off in overall model utility, highlighting performance-safety balance issues.
- Results also show model-specific resistance and residual effects from multi-round adversarial dynamics, implying defenses may need to be tailored and robust over iterative interactions.
- The work concludes that deploying untrusted third-party LLMs requires additional safeguards and customized safety alignment strategies, and it provides accompanying code for experimentation.



