From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

arXiv cs.CL / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLM routing methods optimized for single-turn selection underperform in multi-turn dialogue because rewards are delayed and dialogue interaction dynamics create long-horizon effects.
  • It introduces DialRouter, which uses MCTS to explore alternative dialogue branches produced by different LLM choices and trains from high-cumulative-reward trajectories.
  • DialRouter learns a lightweight routing policy from offline search data, further using retrieval-based future state approximation to avoid online search during deployment.
  • Experiments on both open-domain and domain-specific multi-turn dialogue tasks show DialRouter improves task success rate versus single-LLM baselines and prior routing approaches.
  • The method also achieves a better performance–cost trade-off when using a cost-aware reward, including across candidate sets spanning open-source and closed-source LLMs.

Abstract

Multi-turn dialogue is the predominant form of interaction with large language models (LLMs). While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.