Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
arXiv cs.LG / 5/4/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper argues that standard sparse Mixture-of-Experts (MoE) affinity routing breaks down at domain transitions because pre-transition tokens are statistically indistinguishable from within-domain tokens, leaving the gate with no early warning.
- In controlled experiments with 4 experts, standard routing assigns only ~0.006 probability to the correct expert at the transition, while three lightweight gating changes—beta temporal memory, precision-weighted gating (Pi), and anticipatory routing—dramatically improve correct-expert assignment up to ~0.748 probability (about 124x).
- The authors connect these routing mechanisms to Friston’s Free Energy Principle and implement them using LIF (spiking neuron) dynamics to accumulate routing-relevant context across tokens.
- An ablation over all subsets shows super-additive effects: beta plus anticipation captures ~75% of the oracle gap and beats the sum of individual gains, whereas anticipation alone provides essentially no benefit.
- On a character-level MoE language model, beta-routing reduces transition-step BPC from ~6.56 to ~4.01, and the combined beta+anticipation gate boosts the probability of the correct domain expert before the new domain appears in the input (0.86 vs 0.42 for standard MoE).
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business
A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"
Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds
Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️
Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?
Dev.to