An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv cs.CL / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies jailbreak detection for large language models under realistic settings, using JailbreakBench Behaviors and generator models with different alignment strengths.
  • It compares two approaches—lexical TF-IDF detection and a generation-inconsistency-based detector—across varying sampling budgets (how many outputs are sampled per prompt).
  • The authors find that evaluating only a single output per prompt systematically underestimates jailbreak vulnerability, because harmful behaviors appear more often when multiple generations are sampled.
  • Improvements are largest when moving from single-generation to moderate multi-sampling, while larger sampling budgets offer diminishing returns.
  • Cross-model experiments show partial generalization of detection signals, with better transfer within related model families, and a category analysis indicates lexical detectors also rely on topic-specific cues rather than solely harmful behavior.

Abstract

Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of output based jailbreak detection under realistic conditions using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. We evaluate both a lexical TF-IDF detector and a generation inconsistency based detector across different sampling budgets. Our results show that single output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behaviour. The most significant improvements occur when moving from a single generation to moderate sampling, while larger sampling budgets yield diminishing returns. Cross generator experiments demonstrate that detection signals partially generalise across models, with stronger transfer observed within related model families. A category level analysis further reveals that lexical detectors capture a mixture of behavioural signals and topic specific cues, rather than purely harmful behaviour. Overall, our findings suggest that moderate multi sample auditing provides a more reliable and practical approach for estimating model vulnerability and improving jailbreak detection in large language models. Code will be released.