Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

arXiv cs.LG / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Latent Instruction Representation Alignment (LIRA) to defend LLMs against jailbreaks, backdoors, and undesired knowledge by changing how the model interprets instructions rather than only learning from action outcomes.
  • LIRA trains models for better cross-scenario generalization, and the authors further improve it using an internally adversarial training algorithm.
  • The reported results include blocking more than 99% of PEZ jailbreak attacks, removing an insecure code backdoor, and achieving “optimal forgetting” on WMDP cyber with negligible degradation of benign capabilities.
  • The work positions instruction-representation alignment as a more generalizable alternative to prior approaches that primarily condition training on model behavior under malicious prompts.

Abstract

We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.