ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
arXiv cs.AI / 3/12/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- ADVERSA is an automated red-teaming framework that measures guardrail degradation as continuous per-round trajectories rather than single jailbreak events.
- It uses a fine-tuned attacker model (ADVERSA-Red) to remove attacker-side safety refusals and scores victim responses on a 5-point rubric that treats partial compliance as a distinct state.
- In experiments across Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.2 with 15 conversations of up to 10 rounds, jailbreaks occurred in 26.7% of cases, averaging 1.25 jailbreak rounds per conversation, suggesting early-round vulnerabilities.
- The study uses a triple-judge consensus to quantify judge reliability and reports on inter-judge agreement, self-judge tendencies, attacker drift, and refusals as confounds in measuring victim resistance.
- The authors acknowledge limitations, disclose that attack prompts are withheld, and release experimental artifacts under a responsible-disclosure policy.
Related Articles
Santa Augmentcode Intent Ep.6
Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’
Reddit r/artificial
Scaffolded Test-First Prompting: Get Correct Code From the First Run
Dev.to