From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue
arXiv cs.CL / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that LLM routing methods optimized for single-turn selection underperform in multi-turn dialogue because rewards are delayed and dialogue interaction dynamics create long-horizon effects.
- It introduces DialRouter, which uses MCTS to explore alternative dialogue branches produced by different LLM choices and trains from high-cumulative-reward trajectories.
- DialRouter learns a lightweight routing policy from offline search data, further using retrieval-based future state approximation to avoid online search during deployment.
- Experiments on both open-domain and domain-specific multi-turn dialogue tasks show DialRouter improves task success rate versus single-LLM baselines and prior routing approaches.
- The method also achieves a better performance–cost trade-off when using a cost-aware reward, including across candidate sets spanning open-source and closed-source LLMs.
Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?
SCMP Tech

Why AI Teams Are Standardizing on a Multi-Model Gateway
Dev.to

a claude code/codex plugin to run autoresearch on your repository
Dev.to

AI startup claims to automate app making but actually just uses humans
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to