Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

arXiv cs.LG / 4/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an interpretability-driven safety auditing method for large language models that aims to expose vulnerabilities tied to internal model representations rather than relying only on black-box probing.
  • It performs a two-stage, adaptive grid search using interpretability techniques (Universal Steering and Representation Engineering) to find activation-steering coefficients that enable jailbreaking via unsafe behavioral concepts.
  • Across eight state-of-the-art open-source LLMs, results vary sharply by model: Llama-3 models are highly vulnerable (up to 91% jailbreak success with US and 83% with RepE on Llama-3.3-70B-4bt), while GPT-oss-120B shows strong robustness against both approaches.
  • Smaller Qwen3 and Phi variants generally show lower jailbreak rates, whereas larger versions of those families tend to be more susceptible, indicating size-dependent differences in robustness.
  • The study argues that interpretability-based steering is effective for systematic safety audits, but it also raises dual-use concerns and underscores the need for stronger internal defenses in LLM deployment.

Abstract

Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT-oss- 20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches -- Universal Steering (US) and Representation Engineering (RepE) -- we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robustness. The Llama-3 models are highly vulnerable, with up to 91\% (US) and 83\% (RepE) jailbroken responses on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust to attacks via both interpretability approaches. Qwen and Phi models show mixed results, with the smaller Qwen3-0.6B and Phi4-3.8B mostly exhibiting lower jailbreaking rates, while their larger counterparts are more susceptible. Our results establish interpretability-based steering as a powerful tool for systematic safety audits, but also highlight its dual-use risks and the need for better internal defenses in LLM deployment.