ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

arXiv cs.LG / 4/2/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

ParetoBandit is an open-source adaptive routing layer for non-stationary LLM serving that enforces dollar-denominated per-request cost budgets while optimizing quality.
It uses an online primal-dual budget pacer and geometric forgetting to replace offline tuning with closed-loop control that can adapt to pricing/quality shifts over continuous traffic.
The system supports runtime model hot-swapping via a registry, onboarding new models through a brief forced-exploration phase and then learning their quality-cost niche using live data.
In experiments with 1,824 prompts across four scenarios and a three-model portfolio, ParetoBandit kept mean per-request cost within targets by at most 0.4% and adapted after large price/quality changes without downtime.
Routing overhead is low (9.8 ms end-to-end on CPU; ~22.5 µs for the routing decision), making it suitable for production-style inference pipelines.

Abstract

Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/2DailyView insight →

v5.5.0

Transformers（HuggingFace）Releases

Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Inference Engines - A visual deep dive into the layers of an LLM

Dev.to

Surprised by how capable Qwen3.5 9B is in agentic flows (CodeMode)

Reddit r/LocalLLaMA

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Key Points

Abstract

💡 Insights using this article

Related Articles

v5.5.0

Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Inference Engines - A visual deep dive into the layers of an LLM

Surprised by how capable Qwen3.5 9B is in agentic flows (CodeMode)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer