Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations

arXiv cs.AI / 3/30/2026

📰 News

Key Points

  • The paper tests whether an expensive “manager” LLM can direct a cheaper “worker” LLM to solve software engineering tasks using a two-agent ManagerWorker pipeline with external task dispatch and code execution.
  • Across 200 SWE-bench Lite instances, a strong manager guiding a weak worker reaches 62% accuracy, comparable to a strong single model at 60% accuracy while using far fewer “strong-model” tokens.
  • A weak manager directing a weak worker underperforms the weak baseline (42% vs. 44%), indicating that the manager-worker setup only helps when there is a real capability gap and effective direction.
  • The authors find that value comes from active delegation/structured exploration rather than review-only loops (only +2 percentage points), with planning/exploration adding about +11 points.
  • The results suggest a training limitation: current models are largely trained as monolithic agents, so splitting roles into director/worker fights the training distribution; the proposed fix is to keep each agent near its trained mode and externalize organizational structure in code.
  • categories: [

Abstract

Can an expensive AI model effectively direct a cheap one to solve software engineering tasks? We study this question by introducing ManagerWorker, a two-agent pipeline where an expensive "manager" model (text-only, no code execution) analyzes issues, dispatches exploration tasks, and reviews implementations, while a cheap "worker" model (with full repo access) executes code changes. We evaluate on 200 instances from SWE-bench Lite across five configurations that vary the manager-worker relationship, pipeline complexity, and model pairing. Our findings reveal both the promise and the limits of multi-agent direction: (1) a strong manager directing a weak worker (62%) matches a strong single agent (60%) at a fraction of the strong-model token usage, showing that expensive reasoning can substitute for expensive execution; (2) a weak manager directing a weak worker (42%) performs worse than the weak agent alone (44%), demonstrating that the directing relationship requires a genuine capability gap--structure without substance is pure overhead; (3) the manager's value lies in directing, not merely reviewing--a minimal review-only loop adds just 2pp over the baseline, while structured exploration and planning add 11pp, showing that active direction is what makes the capability gap productive; and (4) these behaviors trace to a single root cause: current models are trained as monolithic agents, and splitting them into director/worker roles fights their training distribution. The pipeline succeeds by designing around this mismatch--keeping each model close to its trained mode (text generation for the manager, tool use for the worker) and externalizing organizational structure to code. This diagnosis points to concrete training gaps: delegation, scoped execution, and mode switching are skills absent from current training data.