In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks

arXiv cs.AI / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Agent orchestration frameworks (e.g., LangGraph, CrewAI, OpenAI Agents SDK) add an external controller that tracks state and injects routing instructions at each turn over an LLM.
  • The paper argues that for procedural, step-by-step tasks, a simpler design—encoding the full procedure in the system prompt and letting the model self-orchestrate—can outperform external orchestration.
  • In controlled tests across three procedural domains (travel booking, Zoom tech support, and insurance claims) using 200 conversations per setup, the in-context approach achieved higher quality scores than the LangGraph orchestrator.
  • The external orchestrator produced substantially higher failure rates in all three domains (notably 24% vs 11.5% for travel, 9% vs 0.5% for Zoom, and 17% vs 5% for insurance).
  • The authors conclude that while external orchestration may have been needed for earlier model generations, frontier model improvements reduce the need for it in multi-turn conversations that follow a defined procedure.

Abstract

Agent orchestration frameworks -- LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, and others -- place an external orchestrator above the LLM, tracking state and injecting routing instructions at every turn. We present a controlled comparison showing that for procedural tasks, this architecture is dominated by a simpler alternative: putting the entire procedure in the system prompt and letting the model self-orchestrate. Across three domains -- travel booking (14 nodes), Zoom technical support (14 nodes), and insurance claims processing (55 nodes) -- we evaluate 200 conversations per condition using LLM-as-judge scoring on five quality criteria. The in-context approach scores 4.53--5.00 on a 5-point scale while a LangGraph orchestrator using the same model scores 4.17--4.84. The orchestrated system fails on 24% of travel, 9% of Zoom, and 17% of insurance conversations, compared to 11.5%, 0.5%, and 5% for the in-context baseline. While external orchestration may have been necessary for earlier models, advances in frontier model capabilities have made it unnecessary for multi-turn conversations following a defined procedure.