Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MechaRule, an explainable-AI pipeline that extracts symbolic rules from LLMs while explicitly grounding those rules in the model’s internal neuron circuitry via “agonists.”
  • MechaRule efficiently localizes sparse agonist neuron sets using a contrastive hierarchical ablation approach that treats neuron localization as adaptive group testing under an approximately monotone, saturating “overtopping” abstraction.
  • The method relies on verifying ablation effects with data splits that closely match faithful rule behavior; it notes that spectral splits can work as a fallback but unfaithful splits harm localization quality.
  • Experiments on arithmetic and jailbreak tasks across Qwen2 and GPT-J show MechaRule can recover 96.8% of high-effect agonists from brute-force comparisons, and ablating the localized agonists sharply reduces arithmetic accuracy and jailbreak success.

Abstract

A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.