ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

arXiv cs.LG / 4/2/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • ParetoBandit is an open-source adaptive routing layer for non-stationary LLM serving that enforces dollar-denominated per-request cost budgets while optimizing quality.
  • It uses an online primal-dual budget pacer and geometric forgetting to replace offline tuning with closed-loop control that can adapt to pricing/quality shifts over continuous traffic.
  • The system supports runtime model hot-swapping via a registry, onboarding new models through a brief forced-exploration phase and then learning their quality-cost niche using live data.
  • In experiments with 1,824 prompts across four scenarios and a three-model portfolio, ParetoBandit kept mean per-request cost within targets by at most 0.4% and adapted after large price/quality changes without downtime.
  • Routing overhead is low (9.8 ms end-to-end on CPU; ~22.5 µs for the routing decision), making it suitable for production-style inference pipelines.

Abstract

Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.