Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

arXiv cs.AI / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces an LM-based multi-agent framework that separates long-horizon automation into three roles: a planner (high-level decisions), an actor (task execution), and a memory manager (contextual reasoning).
  • A key finding from the authors’ compute-allocation analysis is that planning dominates overall task performance, while execution and memory management can be achieved with substantially less compute and model capacity.
  • The authors propose planner-centric reinforcement learning that optimizes only the planner using trajectory-level rewards from a VLM-as-judge, while freezing the actor and memory components.
  • Experiments across benchmarks for web navigation, OS control, and tool use show that focusing capacity and learning on high-level planning improves robustness and compute efficiency in long-horizon agent automation.
  • The research includes a publicly released codebase to support replication and further experimentation.

Abstract

Language model (LM)-based agents have demonstrated promising capabilities in automating complex tasks from natural language instructions, yet they continue to struggle with long-horizon planning and reasoning. To address this, we propose an enhanced multi-agent framework that decomposes automation into three roles: a planner for high-level decision-making, an actor for task execution, and a memory manager for contextual reasoning. While this modular decomposition aligns with established design patterns, our core contribution lies in a systematic compute-allocation analysis, revealing that planning is the dominant factor influencing task performance. Execution and memory management require significantly less compute and model capacity to achieve competitive results. Building on these insights, we introduce a planner-centric reinforcement learning approach, which exclusively optimizes the planner using trajectory-level rewards from a VLM-as-judge, while freezing the other components. Extensive experiments on benchmarks spanning web navigation, OS control, and tool use demonstrate that concentrating model capacity and learning on high-level planning yields robust and compute-efficient improvements in long-horizon agent automation. Our code is publicly released.