SafeSeek: Universal Attribution of Safety Circuits in Language Models

arXiv cs.LG / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SafeSeek, a unified framework for mechanistic interpretability that aims to reliably attribute LLM safety-critical behaviors to functionally complete “safety circuits.”
Instead of heuristic or domain-specific attribution methods, SafeSeek uses differentiable binary masks optimized via gradient descent to extract multi-granular circuits from safety datasets.
It further incorporates Safety Circuit Tuning to reuse the sparse, identified circuits for efficient safety fine-tuning, targeting both interpretability and practical deployment.
In experiments on backdoor attacks, the method identifies a highly sparse backdoor circuit (0.42%) whose ablation collapses attack success rate from 100% to 0.4% while preserving over 99% of general utility.
For safety alignment, SafeSeek localizes an alignment circuit (3.03% heads, 0.79% neurons) whose removal sharply increases ASR from 0.8% to 96.9%, while excluding the circuit during helpfulness fine-tuning preserves safety retention at 96.5%.

Abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\%

\to

0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\%

\to

96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Sector HQ Daily AI Intelligence - March 27, 2026

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Sector HQ Daily AI Intelligence - March 27, 2026

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer