ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
arXiv cs.AI / 3/12/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- ADVERSA is an automated red-teaming framework that measures guardrail degradation as continuous per-round trajectories rather than single jailbreak events.
- It uses a fine-tuned attacker model (ADVERSA-Red) to remove attacker-side safety refusals and scores victim responses on a 5-point rubric that treats partial compliance as a distinct state.
- In experiments across Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.2 with 15 conversations of up to 10 rounds, jailbreaks occurred in 26.7% of cases, averaging 1.25 jailbreak rounds per conversation, suggesting early-round vulnerabilities.
- The study uses a triple-judge consensus to quantify judge reliability and reports on inter-judge agreement, self-judge tendencies, attacker drift, and refusals as confounds in measuring victim resistance.
- The authors acknowledge limitations, disclose that attack prompts are withheld, and release experimental artifacts under a responsible-disclosure policy.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA