Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

arXiv cs.LG / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies query routing for large language models while simultaneously enforcing cost, GPU resource, and concurrency constraints, focusing on failures of prior per-query routing approaches at the batch level.
It proposes a batch-level, resource-aware routing framework that jointly optimizes which model to use for each batch under model capacity limits, rather than making independent per-query decisions.
A robust variant is introduced to handle uncertainty in predicted LLM performance, improving reliability when estimates are imperfect.
Offline instance allocation is also presented to balance quality and throughput across multiple models, further improving end-to-end outcomes.
Experiments on two multi-task LLM benchmarks show robustness improves accuracy by 1–14%, batch-level routing beats per-query methods by up to 24% under adversarial batching, and optimized allocation adds up to ~3% while strictly meeting cost and GPU constraints.

Abstract

We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertainty in predicted LLM performance, along with an offline instance allocation procedure that balances quality and throughput across multiple models. Experiments on two multi-task LLM benchmarks show that robustness improves accuracy by 1-14% over non-robust counterparts (depending on the performance estimator), batch-level routing outperforms per-query methods by up to 24% under adversarial batching, and optimized instance allocation yields additional gains of up to 3% compared to a non-optimized allocation, all while strictly controlling cost and GPU resource constraints.

Anthropic's Accidental Release of Claude Code's Source Code: Irretrievable and Publicly Accessible

Dev.to

Claude Code's Compaction Engine: What the Source Code Actually Reveals

Dev.to

Part 1 - Why I Picked LangChain4j Over Spring AI

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

A Vague Rumor Found Real 0-Days in Vim and Emacs. Here's Why It Worked.

Dev.to

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

Key Points

Abstract

Related Articles

Anthropic's Accidental Release of Claude Code's Source Code: Irretrievable and Publicly Accessible

Claude Code's Compaction Engine: What the Source Code Actually Reveals

Part 1 - Why I Picked LangChain4j Over Spring AI

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

A Vague Rumor Found Real 0-Days in Vim and Emacs. Here's Why It Worked.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer