Self-Mined Hardness for Safety Fine-Tuning

arXiv cs.LG / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a safety fine-tuning method that scores candidate prompts by how frequently the target model’s own rollouts are judged harmful, then trains on the hardest prompts with the model’s corresponding non-jailbroken outputs.
  • On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this “self-mined hardness” approach substantially reduces WildJailbreak attack success rates (from 11.5% and 20.1% to 1–3%), while initially increasing refusals on jailbreak-shaped benign prompts (from 14–22% to 74–94%).
  • To improve the benign-refusal tradeoff, the authors interleave the hardest jailbreak-shaped prompts with adversarially framed benign prompts, reducing the refusal rate to 30–51% on 8B and 52–72% on 3B, at the cost of a smaller increase in attack success rate (2–6 percentage points).
  • Within the mixed training regime, selecting the hardest half of the eligible prompt pool instead of sampling randomly further lowers the remaining attack success rate by 35–50% (about 3 percentage points) on both models.

Abstract

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.