ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ContextualJailbreak, a black-box red-teaming method that uses evolutionary search over simulated multi-turn conversational priming to bypass LLM safety alignment.
  • It addresses a gap in prior automated red-teaming by optimizing dialogue-level priming forms, guided by an in-loop graded 0–5 harm score from a two-level judge.
  • The approach uses five semantically defined mutation operators—roleplay, scenario, expand, troubleshooting, and mechanistic—where troubleshooting and mechanistic are presented as novel contributions.
  • Experiments on 50 HarmBench behaviors show very high attack success rates (e.g., 100% on gpt-oss:20B/qwen3-8B/llama3.1:70B and 90% on gpt-oss:120B), outperforming multiple single- and multi-turn baselines.
  • The most harmful discovered attacks transfer across some closed models but show strong provider-level asymmetry, achieving high success on certain APIs while dropping sharply on others (e.g., much lower against Claude models).

Abstract

Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance. While recent multi-turn, search-based approaches have begun to bridge this gap, the mutator design space underlying effective primed dialogues remains largely unexplored. We present ContextualJailbreak, a black-box red-teaming strategy that performs evolutionary search over a simulated multi-turn primed dialogue. The strategy leverages a graded 0-5 harm score from a two-level judge as an in-loop signal, enabling partially harmful responses to guide the search process rather than being discarded. Search is driven by five semantically defined mutation operators: roleplay, scenario, expand, troubleshooting, and mechanistic, of which the last two are novel contributions of this work. Across 50 representative HarmBench behaviors, ContextualJailbreak achieves an ASR of 100% on gpt-oss:20B, 100% on qwen3-8B, 100% on llama3.1:70B, and 90% on gpt-oss:120B, outperforming four single- and multi-turn baselines by 31-96 percentage points on average. The 40 maximally harmful attacks discovered against gpt-oss:120B transfer without adaptation to closed frontier models, achieving 90.0% on gpt-4o-mini, 70.0% on gpt-5, and 70.0% on gemini-3-flash, but only 17.5% on claude-opus-4-7 and 15.0% on claude-sonnet-4-6, revealing a pronounced provider-level asymmetry in alignment robustness.