GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
arXiv cs.AI / 4/20/2026
💬 OpinionSignals & Early TrendsModels & Research
Key Points
- The paper argues that existing tool-use agent benchmarks don’t reflect real productivity workflows, since they often use AI-generated queries, dummy tools, and weak system-level coordination.
- It introduces GTA-2, a hierarchical benchmark for General Tool Agents that covers both atomic tool use (GTA-Atomic) and long-horizon open-ended workflows (GTA-Workflow) using authentic user queries, deployed tools, and multimodal contexts.
- For evaluating open-ended deliverables, the authors propose a recursive checkpoint-based mechanism that breaks tasks into verifiable sub-goals to enable unified assessment of both model abilities and execution harnesses.
- Experiments show a major capability gap: frontier models score under 50% on atomic tasks and only 14.39% success on workflows, indicating that workflow completion depends heavily on execution framework quality.
- The results also suggest that checkpoint-guided feedback improves performance and that advanced execution frameworks like Manus and OpenClaw significantly boost workflow completion; the dataset and code are planned for release.
Related Articles
Which Version of Qwen 3.6 for M5 Pro 24g
Reddit r/LocalLLaMA

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to