Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

arXiv cs.LG / 4/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes an interpretability-driven safety auditing method for large language models that aims to expose vulnerabilities tied to internal model representations rather than relying only on black-box probing.
It performs a two-stage, adaptive grid search using interpretability techniques (Universal Steering and Representation Engineering) to find activation-steering coefficients that enable jailbreaking via unsafe behavioral concepts.
Across eight state-of-the-art open-source LLMs, results vary sharply by model: Llama-3 models are highly vulnerable (up to 91% jailbreak success with US and 83% with RepE on Llama-3.3-70B-4bt), while GPT-oss-120B shows strong robustness against both approaches.
Smaller Qwen3 and Phi variants generally show lower jailbreak rates, whereas larger versions of those families tend to be more susceptible, indicating size-dependent differences in robustness.
The study argues that interpretability-based steering is effective for systematic safety audits, but it also raises dual-use concerns and underscores the need for stronger internal defenses in LLM deployment.

Abstract

Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT-oss- 20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches -- Universal Steering (US) and Representation Engineering (RepE) -- we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robustness. The Llama-3 models are highly vulnerable, with up to 91\% (US) and 83\% (RepE) jailbroken responses on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust to attacks via both interpretability approaches. Qwen and Phi models show mixed results, with the smaller Qwen3-0.6B and Phi4-3.8B mostly exhibiting lower jailbreaking rates, while their larger counterparts are more susceptible. Our results establish interpretability-based steering as a powerful tool for systematic safety audits, but also highlight its dual-use risks and the need for better internal defenses in LLM deployment.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/24DailyView insight →

How to Stop Your AI Coding Assistant From Being Useless at Specialized Tasks

Dev.to

GPT-5.5 System Card

Dev.to

[NeurIPS 2026] Dumb Question about formating [D]

Reddit r/MachineLearning

Crafting Your AI Rulebook for Niche DTC Support

Dev.to

Multi-Perspective Context Matching for Machine Comprehension

Dev.to

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Key Points

Abstract

💡 Insights using this article

Related Articles

How to Stop Your AI Coding Assistant From Being Useless at Specialized Tasks

GPT-5.5 System Card

[NeurIPS 2026] Dumb Question about formating [D]

Crafting Your AI Rulebook for Niche DTC Support

Multi-Perspective Context Matching for Machine Comprehension

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer