Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

arXiv cs.LG / 4/14/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses an LLM inference stability problem caused by unknown decode lengths, which can cause per-request memory (e.g., KV cache) growth to overflow and destabilize serving systems.
  • It proposes a flow-controlled scheduling framework that limits the rate at which new prompts enter the “active set,” treating stability as a queueing/flow-control problem.
  • The authors derive a necessary stability condition for any system and provide sufficient conditions under which their algorithm can be proven to achieve stability.
  • Experiments indicate improved token/request throughput, reduced average and tail latency, and more stable KV cache utilization compared with several widely used practical scheduling strategies.

Abstract

Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput, lower average and tail latency, and more stable KV cache utilization.