Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a parameter-free decomposition for Mixture-of-Experts (MoE) models that separates each layer’s representation into a control signal for routing and an orthogonal content channel that the router cannot see.
  • Experiments across six MoE architectures show that the content channel retains surface-level properties like language, token identity, and position, while the control signal captures an abstract function that evolves across layers.
  • Because routing decisions are low-bandwidth, the mechanism encourages compositional specialization, making expert paths effectively monosemantic even if individual experts remain polysemantic.
  • The study finds that the same token can follow different trajectories depending on its semantic role (e.g., colon as type annotation vs. punctuation vs. time separator), and that clusters are more monosemantic in the control subspace than in the full representation.
  • The authors conclude that, for interpretability in MoEs, the more natural unit is the token trajectory (route over layers) rather than the expert itself.

Abstract

An LLM's residual stream is both state and instruction: it encodes the current context and determines the next transformation. We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features (language, token identity, position) in the content channel, while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e.g., ":") follows distinct trajectories depending on whether it serves as a type annotation, an introductory colon, or a time separator. Our decomposition identifies the source of this structure: clusters in the control subspace are substantially more monosemantic than those in the full representation. As a result, the natural unit of interpretability in MoEs is not the expert but the trajectory.