SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

arXiv cs.AI / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that GPU schedulers that treat each LLM call independently waste large intermediate states (e.g., KV cache), causing 3–8× higher end-to-end latency for multi-step AI agent tasks.
  • It proposes workflow-level (program-level) scheduling, treating an entire agent workflow as the first-class unit rather than individual inference calls.
  • The proposed SAGA system uses Agent Execution Graphs to predict KV cache reuse, session-affinity batching with work stealing to co-locate related requests, and an “Agent Fair Share” metric to enforce fairness with bounded deviation.
  • On a 64-GPU cluster running SWE-bench coding agents and WebArena browser tasks, SAGA improves task completion time by 1.64× (geometric mean) over vLLM v0.15.1 while boosting GPU memory utilization by 1.22× and achieving 99.2% SLO attainment under multi-tenant contention.
  • The latency improvements trade off with about 30% lower peak throughput compared with throughput-optimal batching, positioning SAGA as a better fit for latency-sensitive interactive deployments.

Abstract

AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of B\'el\'ady's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.