Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

arXiv cs.LG / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLM inference output length is inherently uncertain due to stochastic decoding and EOS sampling, so schedulers should not rely on single-point output-length estimates.
  • It finds that output lengths follow a heavy-tailed distribution and can be modeled with a log-t distribution based on empirical analysis.
  • It introduces Tail Inflated Expectation (TIE) as a new risk-aware metric for replacing point estimates in shortest-job-first (SJF) style scheduling.
  • Experiments show TIE improves inference performance, cutting per-token latency by 2.31× in online inference and increasing offline data-generation throughput by 1.42× versus strong baselines.

Abstract

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by 2.31\times for online inference and improves throughput by 1.42\times for offline data generation.