Duration Aware Scheduling for ASR Serving Under Workload Drift

arXiv cs.LG / 3/13/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies that FCFS scheduling in ASR serving leads to head-of-line blocking when request durations vary, harming end-to-end latency under workload drift.
It demonstrates that audio duration serves as an accurate proxy for model processing time (e.g., Whisper), enabling duration-aware scheduling.
It implements Shortest Job First (SJF) and Highest Response Ratio Next (HRRN) within the vLLM serving framework and evaluates them on LibriSpeech test-clean, showing substantial latency gains and trade-offs.
SJF reduces median E2E latency by up to 73% at high load but can cause long-request starvation, increasing the 90th percentile tail latency by up to 97%.
HRRN mitigates starvation, achieving up to 28% median latency reduction while bounding tail-latency degradation to at most 24%, with gains persisting under workload drift and minimal overhead (<0.1 ms per request).

Abstract

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which ignores variability in request duration and leads to head-of-line blocking under workload drift. We show that audio duration is an accurate proxy for job processing time in ASR models such as Whisper, and use this insight to enable duration-aware scheduling. We integrate two classical algorithms, Shortest Job First (SJF) and Highest Response Ratio Next (HRRN), into vLLM and evaluate them under realistic and drifted workloads. On LibriSpeech test-clean, compared to baseline, SJF reduces median E2E latency by up to

73\%

at high load, but increases

90

th-percentile tail latency by up to

97\%

due to starvation of long requests. HRRN addresses this trade-off: it reduces median E2E latency by up to

28\%

while bounding tail-latency degradation to at most

24\%

. These gains persist under workload drift, with no throughput penalty and

<0.1

\,ms scheduling overhead per request.

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

SYNCAI

Dev.to

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

Dev.to

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

Duration Aware Scheduling for ASR Serving Under Workload Drift

Key Points

Abstract

Related Articles

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

SYNCAI

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

When AI Grows Up: Identity, Memory, and What Persists Across Versions

AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer