Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
arXiv cs.LG / 3/31/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies query routing for large language models while simultaneously enforcing cost, GPU resource, and concurrency constraints, focusing on failures of prior per-query routing approaches at the batch level.
- It proposes a batch-level, resource-aware routing framework that jointly optimizes which model to use for each batch under model capacity limits, rather than making independent per-query decisions.
- A robust variant is introduced to handle uncertainty in predicted LLM performance, improving reliability when estimates are imperfect.
- Offline instance allocation is also presented to balance quality and throughput across multiple models, further improving end-to-end outcomes.
- Experiments on two multi-task LLM benchmarks show robustness improves accuracy by 1–14%, batch-level routing beats per-query methods by up to 24% under adversarial batching, and optimized allocation adds up to ~3% while strictly meeting cost and GPU constraints.
Related Articles

Anthropic's Accidental Release of Claude Code's Source Code: Irretrievable and Publicly Accessible
Dev.to

Claude Code's Compaction Engine: What the Source Code Actually Reveals
Dev.to

Part 1 - Why I Picked LangChain4j Over Spring AI
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

A Vague Rumor Found Real 0-Days in Vim and Emacs. Here's Why It Worked.
Dev.to