Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

arXiv cs.AI / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tests whether Mixture-of-Experts (MoE) routing topology (e.g., learned routing, multi-hop routing, token-dependent gating) actually affects language modeling quality, using a geometric cosine-similarity router (ST-MoE) in a low-dimensional space (d_space=64).
Across 62 controlled experiments on WikiText-103 (76–84M parameters) trained to convergence, five cosine-routing variants show statistically equivalent asymptotic perplexity within a 1-PPL margin, indicating routing topology does not determine final modeling quality.
The results also hold for other routing approaches (hash, random-fixed, top-1), with only graceful degradation observed in some cases, and similar behavior replicated on OpenWebText.
Compared with a standard linear router using 5.3× more routing parameters, iso-parameter cosine routing recovers 67% of the perplexity gap, suggesting the “mechanism advantage” of cosine routing is small (~1.2%) overall.
The authors explain the near-indifference to topology via convergent redundancy: multi-hop updates are largely collinear, acting more like magnitude amplification than compositional reasoning, and they demonstrate a practical compute reduction via zero-shot relative-norm halting that saves 25% of MoE FLOPs with only +0.12% PPL.

Abstract

Sparse Mixture-of-Experts (MoE) architectures employ increasingly sophisticated routing mechanisms -- learned routers, multi-hop trajectories, token-dependent gating. We ask: does routing topology actually determine language modeling quality? We build a geometric MoE (ST-MoE) using cosine-similarity routing against learned centroids in a low-dimensional space (

d_{space} = 64

), requiring 80% fewer routing parameters than standard linear routers. Through 62 controlled experiments on WikiText-103 at 76--84M parameters trained to convergence (50K steps, 1.64B tokens), we find that routing topology does not determine asymptotic perplexity (PPL): five cosine-routing variants are statistically equivalent within a 1-PPL margin (Two One-Sided Tests [TOST],

p < 0.05

for all 10 pairwise comparisons; 15 runs across 3 seeds, observed range 33.93--34.72). The finding extends to hash, random-fixed, and top-1 routing (single-seed; graceful 1.1--2.2 PPL degradation) and replicates on OpenWebText (0.03 PPL gap, 6 runs, 3 seeds each). A standard linear router with 5.3

\times

more routing parameters reaches PPL 32.76, but iso-parameter cosine routing closes 67% of this gap -- the true mechanism advantage is

\sim

1.2%. The mechanistic explanation is convergent redundancy: multi-hop updates are collinear (

\cos(\Delta h_0, \Delta h_1) = 0.805

), implementing magnitude amplification rather than compositional reasoning; a single learnable scalar replicates multi-hop performance. As a practical payoff, zero-shot relative-norm halting saves 25% of MoE FLOPs at +0.12% PPL. Expert-level specialization and causal controllability -- which coexist with topology-level equifinality -- are explored in a companion paper.