SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
arXiv cs.AI / 5/4/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that GPU schedulers that treat each LLM call independently waste large intermediate states (e.g., KV cache), causing 3–8× higher end-to-end latency for multi-step AI agent tasks.
- It proposes workflow-level (program-level) scheduling, treating an entire agent workflow as the first-class unit rather than individual inference calls.
- The proposed SAGA system uses Agent Execution Graphs to predict KV cache reuse, session-affinity batching with work stealing to co-locate related requests, and an “Agent Fair Share” metric to enforce fairness with bounded deviation.
- On a 64-GPU cluster running SWE-bench coding agents and WebArena browser tasks, SAGA improves task completion time by 1.64× (geometric mean) over vLLM v0.15.1 while boosting GPU memory utilization by 1.22× and achieving 99.2% SLO attainment under multi-tenant contention.
- The latency improvements trade off with about 30% lower peak throughput compared with throughput-optimal batching, positioning SAGA as a better fit for latency-sensitive interactive deployments.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"
Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds
Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️
Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?
Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on
Dev.to