MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

arXiv cs.CL / 5/5/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

MultiBreakは、会話の流れを模した「マルチターン」型のジャイルブレイクを評価するための、スケーラブルで多様性の高いベンチマークを提案しています。
既存のマルチターンベンチマークの規模やテンプレ依存による多様性の制約を埋めるため、幅広い有害なジャイルブレイク意図を統合し、能動学習で高品質な攻撃プロンプトを拡張する仕組みを導入しています。
MultiBreakには10,389件のマルチターン攻撃プロンプトが含まれ、2,665件の異なる有害意図をカバーし、これまでで最も多様なトピック範囲を目指しています。
実験では、DeepSeek-R1-7BやGPT-4.1-miniに対して攻撃成功率（ASR）を最大54.0 / 34.6ポイント上回り、さらに多様な攻撃カテゴリがLLMの細かな脆弱性をより明確にすることが示されています。
単発では無害に見えるカテゴリでもマルチターンでは攻撃効果が大きくなるなど、現実的な敵対状況におけるLLMの継続的な脆弱性を示す研究であり、LLM安全性向上のための資源として位置づけられています。

Abstract

We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0 and 34.6 higher attack success rate (ASR)} than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively. More importantly, safety evaluations suggest that diverse attack categories uncover fine-grained LLM vulnerabilities}, and categories that appear benign under single-turn can exhibit substantially higher adversarial effectiveness in multi-turn scenarios. These findings highlight persistent vulnerabilities of LLMs under realistic adversarial settings and establish MultiBreak as a scalable resource for advancing LLM safety.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/5DailyView insight →

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage

TechCrunch

Google, Microsoft, and xAI will allow the US government to review their new AI models

The Verge

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

Dev.to

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors

TechCrunch

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

Key Points

Abstract

💡 Insights using this article

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Meta will use AI to analyze height and bone structure to identify if users are underage

Google, Microsoft, and xAI will allow the US government to review their new AI models

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer