ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

arXiv cs.AI / 3/12/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

ADVERSA is an automated red-teaming framework that measures guardrail degradation as continuous per-round trajectories rather than single jailbreak events.
It uses a fine-tuned attacker model (ADVERSA-Red) to remove attacker-side safety refusals and scores victim responses on a 5-point rubric that treats partial compliance as a distinct state.
In experiments across Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.2 with 15 conversations of up to 10 rounds, jailbreaks occurred in 26.7% of cases, averaging 1.25 jailbreak rounds per conversation, suggesting early-round vulnerabilities.
The study uses a triple-judge consensus to quantify judge reliability and reports on inter-judge agreement, self-judge tendencies, attacker drift, and refusals as confounds in measuring victim resistance.
The authors acknowledge limitations, disclose that attack prompts are withheld, and release experimental artifacts under a responsible-disclosure policy.

Abstract

Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red-teaming framework that measures guardrail degradation dynamics as continuous per-round compliance trajectories rather than discrete jailbreak events. ADVERSA uses a fine-tuned 70B attacker model (ADVERSA-Red, Llama-3.1-70B-Instruct with QLoRA) that eliminates the attacker-side safety refusals that render off-the-shelf models unreliable as attackers, scoring victim responses on a structured 5-point rubric that treats partial compliance as a distinct measurable state. We report a controlled experiment across three frontier victim models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.2) using a triple-judge consensus architecture in which judge reliability is measured as a first-class research outcome rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting that in this evaluation setting, successful jailbreaks were concentrated in early rounds rather than accumulating through sustained pressure. We document inter-judge agreement rates, self-judge scoring tendencies, attacker drift as a failure mode in fine-tuned attackers deployed out of their training distribution, and attacker refusals as a previously-underreported confound in victim resistance measurement. All limitations are stated explicitly. Attack prompts are withheld per responsible disclosure policy; all other experimental artifacts are released.

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

Reddit r/LocalLLaMA

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Key Points

Abstract

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer