SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

arXiv cs.AI / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that GPU schedulers that treat each LLM call independently waste large intermediate states (e.g., KV cache), causing 3–8× higher end-to-end latency for multi-step AI agent tasks.
It proposes workflow-level (program-level) scheduling, treating an entire agent workflow as the first-class unit rather than individual inference calls.
The proposed SAGA system uses Agent Execution Graphs to predict KV cache reuse, session-affinity batching with work stealing to co-locate related requests, and an “Agent Fair Share” metric to enforce fairness with bounded deviation.
On a 64-GPU cluster running SWE-bench coding agents and WebArena browser tasks, SAGA improves task completion time by 1.64× (geometric mean) over vLLM v0.15.1 while boosting GPU memory utilization by 1.22× and achieving 99.2% SLO attainment under multi-tenant contention.
The latency improvements trade off with about 30% lower peak throughput compared with throughput-optimal batching, positioning SAGA as a better fit for latency-sensitive interactive deployments.

Abstract

AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of B\'el\'ady's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/4DailyView insight →

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

Dev.to

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Key Points

Abstract

💡 Insights using this article

Related Articles

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer