Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

arXiv cs.LG / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies query routing for large language models while simultaneously enforcing cost, GPU resource, and concurrency constraints, focusing on failures of prior per-query routing approaches at the batch level.
  • It proposes a batch-level, resource-aware routing framework that jointly optimizes which model to use for each batch under model capacity limits, rather than making independent per-query decisions.
  • A robust variant is introduced to handle uncertainty in predicted LLM performance, improving reliability when estimates are imperfect.
  • Offline instance allocation is also presented to balance quality and throughput across multiple models, further improving end-to-end outcomes.
  • Experiments on two multi-task LLM benchmarks show robustness improves accuracy by 1–14%, batch-level routing beats per-query methods by up to 24% under adversarial batching, and optimized allocation adds up to ~3% while strictly meeting cost and GPU constraints.

Abstract

We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertainty in predicted LLM performance, along with an offline instance allocation procedure that balances quality and throughput across multiple models. Experiments on two multi-task LLM benchmarks show that robustness improves accuracy by 1-14% over non-robust counterparts (depending on the performance estimator), batch-level routing outperforms per-query methods by up to 24% under adversarial batching, and optimized instance allocation yields additional gains of up to 3% compared to a non-optimized allocation, all while strictly controlling cost and GPU resource constraints.