Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
arXiv cs.LG / 4/20/2026
📰 NewsModels & Research
Key Points
- The paper argues that even aligned LLMs can produce unsafe outputs because pretraining leaves behind “unsafe subnetworks” that existing methods (SFT/RLHF) do not explicitly eliminate.
- It proposes a resource-efficient, gradient-free pruning framework that identifies and removes parameters linked to unsafe behaviors while keeping overall model utility.
- The approach is designed to be lightweight—using only modest GPU resources—and is reported to generalize across architectures and quantized model variants.
- Experiments indicate large reductions in unsafe generations and better robustness against jailbreak attacks, with minimal loss in utility.
- Interpreting results via the Lottery Ticket Hypothesis, the authors claim pruning can remove “unsafe tickets” and expose “safety tickets,” enabling a post-hoc alignment method for deployment in constrained environments.
Related Articles
Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]
Reddit r/MachineLearning

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark
Dev.to

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting
Dev.to

A Claude Code hook that warns you before calling a low-trust MCP server
Dev.to

Waiting Qwen3.6-27B I have no nails left...
Reddit r/LocalLLaMA