Self-Mined Hardness for Safety Fine-Tuning
arXiv cs.LG / 5/6/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a safety fine-tuning method that scores candidate prompts by how frequently the target model’s own rollouts are judged harmful, then trains on the hardest prompts with the model’s corresponding non-jailbroken outputs.
- On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this “self-mined hardness” approach substantially reduces WildJailbreak attack success rates (from 11.5% and 20.1% to 1–3%), while initially increasing refusals on jailbreak-shaped benign prompts (from 14–22% to 74–94%).
- To improve the benign-refusal tradeoff, the authors interleave the hardest jailbreak-shaped prompts with adversarially framed benign prompts, reducing the refusal rate to 30–51% on 8B and 52–72% on 3B, at the cost of a smaller increase in attack success rate (2–6 percentage points).
- Within the mixed training regime, selecting the hardest half of the eligible prompt pool instead of sampling randomly further lowers the remaining attack success rate by 35–50% (about 3 percentage points) on both models.
Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide
Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'
Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss
MarkTechPost
When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability
MarkTechPost
Solidity LM surpasses Opus
Reddit r/LocalLLaMA