Self-Routing: Parameter-Free Expert Routing from Hidden States

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Self-Routing,” a parameter-free Mixture-of-Experts routing method that turns a designated hidden-state subspace directly into expert logits, removing the need for a learned router projection module.
  • Experiments on GPT-2-scale language modeling show Self-Routing performs competitively with a standard learned-router baseline, while eliminating all dedicated routing parameters.
  • Self-Routing improves expert utilization balance, achieving about 17% higher average normalized routing entropy and avoiding an explicit load-balancing loss.
  • On ImageNet-1K with DeiT-S/16, Self-Routing slightly outperforms the corresponding learned-router MoE, indicating the approach can generalize beyond language models.
  • The authors conclude that effective MoE routing can be derived from the model’s hidden representations themselves, challenging the assumption that a dedicated learned router is strictly necessary.

Abstract

Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.