Self-Mined Hardness for Safety Fine-Tuning

arXiv cs.LG / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a safety fine-tuning method that scores candidate prompts by how frequently the target model’s own rollouts are judged harmful, then trains on the hardest prompts with the model’s corresponding non-jailbroken outputs.
On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this “self-mined hardness” approach substantially reduces WildJailbreak attack success rates (from 11.5% and 20.1% to 1–3%), while initially increasing refusals on jailbreak-shaped benign prompts (from 14–22% to 74–94%).
To improve the benign-refusal tradeoff, the authors interleave the hardest jailbreak-shaped prompts with adversarially framed benign prompts, reducing the refusal rate to 30–51% on 8B and 52–72% on 3B, at the cost of a smaller increase in attack success rate (2–6 percentage points).
Within the mixed training regime, selecting the hardest half of the eligible prompt pool instead of sampling randomly further lowers the remaining attack success rate by 35–50% (about 3 percentage points) on both models.

Abstract

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Self-Mined Hardness for Safety Fine-Tuning

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

Solidity LM surpasses Opus

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer