Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
arXiv cs.CL / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper demonstrates that safety-tuning can still miss rare unsafe behaviors, leaving long-tail risks in LLM outputs.
- It introduces Progressive Diverse Population Sampling (PDPS), a method that combines stochastic token sampling with diversity-aware selection to generate a large pool of candidate responses and retain a compact, diverse subset.
- PDPS achieves jailbreak success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost, and under limited-response settings it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search.
- Across multiple jailbreak benchmarks and open-source LLMs, PDPS yields more diverse unsafe outputs, broadening the range of detectable failures.
Related Articles
[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning
Reddit r/MachineLearning
How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails
Dev.to
Complete Guide: How To Make Money With Ai
Dev.to
I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+
Dev.to
The Demethylation
Dev.to