LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that aligned LLMs can be tricked into producing unsafe outputs when prompts undergo natural semantic distribution shifts from known toxic prompt clusters to seemingly benign but related prompts.
  • It introduces a new multi-turn attack method called ActorBreaker that finds actor-like elements associated with toxic prompts in the pre-training distribution and uses them to gradually steer the model toward revealing unsafe content.
  • Experiments indicate ActorBreaker is more diverse, effective, and efficient than prior attack approaches across multiple aligned LLMs.
  • To mitigate the vulnerability, the authors propose expanding safety training to cover a broader semantic space of toxic content, generating a multi-turn safety dataset via ActorBreaker and fine-tuning on it.
  • Fine-tuning on the proposed dataset improves robustness against the identified attack strategy, with some reported trade-offs in overall utility.

Abstract

Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.