Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

arXiv cs.CL / 4/10/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a key inefficiency in production vLLM fleets: instances are provisioned for worst-case long contexts, causing KV-cache overallocation for the majority of short requests and reducing effective throughput by 4–8×.
It proposes “dual-pool token-budget routing,” which splits a homogeneous LLM serving fleet into a short-context high-throughput pool and a long-context high-capacity pool, routing requests using an estimated total token budget.
The routing estimate uses a learned per-category bytes-to-token ratio updated online via exponential moving average from prompt token feedback, avoiding the need for a tokenizer.
Experiments on Azure LLM Inference traces and LMSYS-Chat-1M (serving Llama-3-70B on A100 GPUs) show 31–42% GPU-hour reductions (about $2.86M annual savings at scale), with preemption rates dropping by 5.4× and P99 TTFT improving by 6%.
The approach adds only O(1) dispatch overhead, adapts to heterogeneous workloads, and works alongside common vLLM optimizations like PagedAttention, continuous batching, and prefill–decode disaggregation.

Abstract

Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8

\times

throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \

2.86M annual savings at fleet scale, while lowering preemption rates by 5.4

\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/10DailyView insight →

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

My AI-900 Experience Learning Azure AI from Scratch

Dev.to

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

Dev.to

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

Key Points

Abstract

💡 Insights using this article

Related Articles

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

My AI-900 Experience Learning Azure AI from Scratch

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer