Disentangling MLP Neuron Weights in Vocabulary Space

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free mechanistic interpretability method that disentangles MLP neurons directly in weight space without any forward passes.
ROTATE uses a statistical insight that neurons corresponding to coherent, monosemantic concepts show high kurtosis when their weights are projected into the model’s vocabulary space.
By optimizing rotations of neuron weight vectors to maximize vocabulary-space kurtosis, the method recovers sparse, interpretable directions called “vocabulary channels.”
Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it indicate the recovered vocabulary channels match the neurons’ functional behavior, including targeted effects from ablating specific channels.
Aggregating channel-level descriptions produces more comprehensive neuron interpretations than activation-based baselines, improving performance by roughly 2–3x in comparisons.

Abstract

Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron's behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.

Black Hat Asia

AI Business

Research with ChatGPT

Dev.to

Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it

Reddit r/LocalLLaMA

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem

Dev.to

The 10 Best AI Tools for SEO and Digital Marketing in 2026

Dev.to

Disentangling MLP Neuron Weights in Vocabulary Space

Key Points

Abstract

Related Articles

Black Hat Asia

Research with ChatGPT

Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem

The 10 Best AI Tools for SEO and Digital Marketing in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer