Disentangling MLP Neuron Weights in Vocabulary Space
arXiv cs.CL / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free mechanistic interpretability method that disentangles MLP neurons directly in weight space without any forward passes.
- ROTATE uses a statistical insight that neurons corresponding to coherent, monosemantic concepts show high kurtosis when their weights are projected into the model’s vocabulary space.
- By optimizing rotations of neuron weight vectors to maximize vocabulary-space kurtosis, the method recovers sparse, interpretable directions called “vocabulary channels.”
- Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it indicate the recovered vocabulary channels match the neurons’ functional behavior, including targeted effects from ablating specific channels.
- Aggregating channel-level descriptions produces more comprehensive neuron interpretations than activation-based baselines, improving performance by roughly 2–3x in comparisons.
Related Articles

Black Hat Asia
AI Business
Research with ChatGPT
Dev.to
Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it
Reddit r/LocalLLaMA

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem
Dev.to

The 10 Best AI Tools for SEO and Digital Marketing in 2026
Dev.to