Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
arXiv cs.LG / 5/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates how dynamic adversarial fine-tuning (R2D2-style) reshapes the internal “refusal geometry” of a safety-aligned 7B language model during training.
- Using a measurement-driven protocol that combines HarmBench, StrongREJECT, and XSTest with a five-anchor refusal-geometry suite and causal interventions, the authors track changes in jailbreak/refusal behavior over training steps.
- Results show R2D2 can drive HarmBench attack success to near-zero at early-to-mid training (0.000 at steps 50 and 100), then partially increases later (0.035 at step 250 and 0.250 at step 500), while standard SFT stays much less robust (ASR ~0.505–0.588).
- On XSTest, R2D2 maintains strong early “any-refusal” (1.000) but the metric declines substantially over time (0.664 and 0.228), indicating evolving refusal characteristics rather than a static defense.
- The authors find that refusal “carriers” relocate from later-layer to earlier-layer representations during training while effective control rank stays roughly constant (~1.23–1.27), supporting a “reorganization” mechanism over a “drift-only” explanation.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER