On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
arXiv cs.LG / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies why MoE-based multimodal continual instruction tuning for large vision-language models still forgets prior knowledge, attributing the core issue to “routing-drift” where old-task tokens get misrouted to newly added experts.
- It identifies a token-level failure mode (“token’s dilemma”): ambiguous or old tokens in new-task data provide little learning benefit but can trigger forgetting because their routing assignments become unstable during training.
- To address this, the authors propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands experts while using drift-aware, token-level assignment guidance and routing-score regularization to preserve expert-group separation.
- Experiments on continual instruction tuning show the method reduces forgetting (reported as a ~12% reduction) and improves mean final accuracy by over 7% versus baseline approaches.
- The work accompanies an online project page for accessing the DyMoE resources.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business
[D] How does distributed proof of work computing handle the coordination needs of neural network training?
Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside
Dev.to

BYOK is not just a pricing model: why it changes AI product trust
Dev.to

AI Citation Registries and Identity Persistence Across Records
Dev.to