Adaptive Instruction Composition for Automated LLM Red-Teaming

arXiv cs.CL / 4/24/2026

💬 OpinionModels & Research

Key Points

  • The paper proposes an Adaptive Instruction Composition framework for LLM red-teaming that improves on random or trial-and-error instruction generation by jointly optimizing effectiveness and diversity.
  • It uses reinforcement learning to balance exploration and exploitation while searching a large combinatorial space of red-team instructions aimed at target vulnerabilities.
  • Experiments show the method significantly outperforms random instruction combination on both effectiveness and diversity metrics, including when attacks are tested under model transfer.
  • The approach also beats multiple recent adaptive baselines on the Harmbench benchmark, using a lightweight neural contextual bandit with contrastive embedding inputs.
  • Ablation studies indicate that contrastive pretraining helps the bandit generalize quickly and scale to much larger instruction spaces as it learns.

Abstract

Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.