Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper argues that standard sparse Mixture-of-Experts (MoE) affinity routing breaks down at domain transitions because pre-transition tokens are statistically indistinguishable from within-domain tokens, leaving the gate with no early warning.
  • In controlled experiments with 4 experts, standard routing assigns only ~0.006 probability to the correct expert at the transition, while three lightweight gating changes—beta temporal memory, precision-weighted gating (Pi), and anticipatory routing—dramatically improve correct-expert assignment up to ~0.748 probability (about 124x).
  • The authors connect these routing mechanisms to Friston’s Free Energy Principle and implement them using LIF (spiking neuron) dynamics to accumulate routing-relevant context across tokens.
  • An ablation over all subsets shows super-additive effects: beta plus anticipation captures ~75% of the oracle gap and beats the sum of individual gains, whereas anticipation alone provides essentially no benefit.
  • On a character-level MoE language model, beta-routing reduces transition-step BPC from ~6.56 to ~4.01, and the combined beta+anticipation gate boosts the probability of the correct domain expert before the new domain appears in the input (0.86 vs 0.42 for standard MoE).

Abstract

Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super-additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/- 0.001); beta alone gives modest gain (+0.295 +/- 0.013); combined they close 75% of the oracle gap (+0.741 +/- 0.002, exceeding the sum by +0.446 +/- 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre-transition tokens are distributionally identical to within-domain tokens. In a character-level MoE LM (5 seeds), beta-routing reduces transition-step BPC from 6.56 +/- 0.01 (Standard) to 4.01 +/- 0.15 (beta-MoE); the beta + Ant gate places 0.86 +/- 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/- 0.12 for Standard MoE. Reference implementations (~200 lines each): https://github.com/russellwmy/affinity-is-not-enough