Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MechaRule, an explainable-AI pipeline that extracts symbolic rules from LLMs while explicitly grounding those rules in the model’s internal neuron circuitry via “agonists.”
MechaRule efficiently localizes sparse agonist neuron sets using a contrastive hierarchical ablation approach that treats neuron localization as adaptive group testing under an approximately monotone, saturating “overtopping” abstraction.
The method relies on verifying ablation effects with data splits that closely match faithful rule behavior; it notes that spectral splits can work as a fallback but unfaithful splits harm localization quality.
Experiments on arithmetic and jailbreak tasks across Qwen2 and GPT-J show MechaRule can recover 96.8% of high-effect agonists from brute-force comparisons, and ablating the localized agonists sharply reduces arithmetic accuracy and jailbreak success.

Abstract

A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

SIFS (SIFS Is Fast Search) - local code search for coding agents

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer