Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
arXiv cs.LG / 4/3/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that diffusion language model (DLM) mixture-of-experts (MoE) should use expert-choice (EC) routing instead of token-choice (TC) routing, since TC can cause load imbalance and inflexible compute allocation.
- EC routing is presented as providing deterministic load balancing by design, leading to higher throughput and faster convergence in experiments compared with TC under similar settings.
- The authors introduce timestep-dependent expert capacity for EC routing, reallocating expert resources across denoising steps and finding that giving more capacity to low-mask-ratio steps improves performance when FLOPs are matched.
- They provide a mechanistic rationale that low-mask-ratio contexts show significantly higher learning efficiency, so concentrating compute there yields the greatest marginal gains.
- The work also shows that pretrained TC-based DLMs can be retrofitted to EC by swapping only the router, improving convergence speed and accuracy across multiple downstream tasks, with code released publicly.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story
Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure
Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts
MarkTechPost

Portable eye scanner powered by AI expands access to low-cost community screening
Reddit r/artificial