ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
arXiv cs.LG / 4/2/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- ParetoBandit is an open-source adaptive routing layer for non-stationary LLM serving that enforces dollar-denominated per-request cost budgets while optimizing quality.
- It uses an online primal-dual budget pacer and geometric forgetting to replace offline tuning with closed-loop control that can adapt to pricing/quality shifts over continuous traffic.
- The system supports runtime model hot-swapping via a registry, onboarding new models through a brief forced-exploration phase and then learning their quality-cost niche using live data.
- In experiments with 1,824 prompts across four scenarios and a three-model portfolio, ParetoBandit kept mean per-request cost within targets by at most 0.4% and adapted after large price/quality changes without downtime.
- Routing overhead is low (9.8 ms end-to-end on CPU; ~22.5 µs for the routing decision), making it suitable for production-style inference pipelines.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
v5.5.0
Transformers(HuggingFace)Releases
Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke
Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Inference Engines - A visual deep dive into the layers of an LLM
Dev.to
Surprised by how capable Qwen3.5 9B is in agentic flows (CodeMode)
Reddit r/LocalLLaMA