Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
arXiv cs.LG / 4/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes an interpretability-driven safety auditing method for large language models that aims to expose vulnerabilities tied to internal model representations rather than relying only on black-box probing.
- It performs a two-stage, adaptive grid search using interpretability techniques (Universal Steering and Representation Engineering) to find activation-steering coefficients that enable jailbreaking via unsafe behavioral concepts.
- Across eight state-of-the-art open-source LLMs, results vary sharply by model: Llama-3 models are highly vulnerable (up to 91% jailbreak success with US and 83% with RepE on Llama-3.3-70B-4bt), while GPT-oss-120B shows strong robustness against both approaches.
- Smaller Qwen3 and Phi variants generally show lower jailbreak rates, whereas larger versions of those families tend to be more susceptible, indicating size-dependent differences in robustness.
- The study argues that interpretability-based steering is effective for systematic safety audits, but it also raises dual-use concerns and underscores the need for stronger internal defenses in LLM deployment.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.


