SafeSeek: Universal Attribution of Safety Circuits in Language Models

arXiv cs.LG / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SafeSeek, a unified framework for mechanistic interpretability that aims to reliably attribute LLM safety-critical behaviors to functionally complete “safety circuits.”
  • Instead of heuristic or domain-specific attribution methods, SafeSeek uses differentiable binary masks optimized via gradient descent to extract multi-granular circuits from safety datasets.
  • It further incorporates Safety Circuit Tuning to reuse the sparse, identified circuits for efficient safety fine-tuning, targeting both interpretability and practical deployment.
  • In experiments on backdoor attacks, the method identifies a highly sparse backdoor circuit (0.42%) whose ablation collapses attack success rate from 100% to 0.4% while preserving over 99% of general utility.
  • For safety alignment, SafeSeek localizes an alignment circuit (3.03% heads, 0.79% neurons) whose removal sharply increases ASR from 0.8% to 96.9%, while excluding the circuit during helpfulness fine-tuning preserves safety retention at 96.5%.

Abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% \to 0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\% \to 96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.

SafeSeek: Universal Attribution of Safety Circuits in Language Models | AI Navigate