Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

arXiv cs.AI / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that even after safety-alignment advances, LLMs can still be compromised by persona-based jailbreaks, and existing defenses lack robust systemic/mechanistic constraints.
  • It introduces Persona-Invariant Alignment (PIA), an adversarial self-play framework that co-evolves an attack strategy via Persona Lineage Evolution (PLE) and a defense strategy via Persona-Invariant Consistency Learning (PICL).
  • The defense (PICL) is theoretically motivated by a “structural separation” hypothesis, using a unilateral KL-divergence constraint to decouple safety decisions from persona context.
  • Experiments report that PLE effectively searches high-risk persona spaces, while PICL substantially lowers the Attack Success Rate (ASR) without significantly harming the model’s general capabilities, and the authors provide accompanying code.

Abstract

The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.