MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems

arXiv cs.LG / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper argues that deploying LLMs as autonomous agent “execution cores” changes workloads from single-turn GPU inference to multi-turn LLM–tool loops spanning both GPUs and CPUs.
It introduces MARS, an adaptive co-scheduling system that coordinates heterogeneous agentic workloads under coupled GPU–CPU resource pressure using unified visibility and a control plane that separates admission from execution.
MARS uses an internal, agent-centric scheduler to cut end-to-end critical path time by prioritizing latency-sensitive continuations and adaptively keeping KV cache only when it improves “warm resumption” latency.
Experiments report up to 5.94× lower end-to-end latency with nearly maximal throughput, and integrating MARS into the OpenHands coding agent accelerates task completion up to 1.87×.
The authors state that the MARS source code will be made publicly available soon.

Abstract

Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal agent-centric scheduler further minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache state only when warm resumption yields a latency benefit. Our evaluations show that MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. We further integrate MARS as the serving backend for the OpenHands coding agent framework, demonstrating its real-world effectiveness by accelerating end-to-end task completion time by up to 1.87x. Our source code will be publicly available soon.