Path-Constrained Mixture-of-Experts
arXiv cs.LG / 3/20/2026
📰 NewsModels & Research
Key Points
- PathMoE shares router parameters across consecutive layers to reduce the combinatorial path space in sparse MoE architectures, addressing statistical inefficiency from independent routing.
- The method achieves consistent improvements in perplexity and downstream tasks on 0.9B and 16B parameter models, without requiring auxiliary load-balancing losses.
- Analysis shows tokens following the same path cluster by linguistic function, with PathMoE producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations.
- The work reframes MoE architectures around the concept of expert paths, offering new insights into design and analysis.
Related Articles
[D] Matryoshka Representation Learning
Reddit r/MachineLearning
Two new Qwen3.5 “Neo” fine‑tunes focused on fast, efficient reasoning
Reddit r/LocalLLaMA

HKIC, Gobi Partners and HKU team up for fund backing university research start-ups
SCMP Tech
Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling
MarkTechPost
Streaming experts
Simon Willison's Blog