AutomationBench

arXiv cs.AI / 4/22/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Existing software automation benchmarks are often too narrow, failing to cover cross-application coordination, autonomous API discovery, and policy compliance in a single evaluation.
AutomationBench is introduced as an arXiv benchmark focused on evaluating AI agents’ ability to orchestrate REST-API workflows across multiple business systems.
The benchmark is based on real workflow patterns (e.g., from Zapier), spanning domains like Sales, Marketing, Operations, Support, Finance, and HR, and includes irrelevant or misleading records.
Evaluation is programmatic and end-state based, checking whether the agent wrote the correct data into the correct systems, rather than intermediate reasoning steps.
Current leading AI models perform poorly on AutomationBench, scoring below 10%, highlighting a gap between today’s agentic capabilities and practical business needs.

Abstract

Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier's platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right systems. Even the best frontier models currently score below 10%. AutomationBench provides a challenging, realistic measure of where current models stand relative to the agentic capabilities businesses actually need.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/22DailyView insight →

How to Stop Your AI Coding Assistant From Being Useless at Specialized Tasks

Dev.to

GPT-5.5 System Card

Dev.to

[NeurIPS 2026] Dumb Question about formating [D]

Reddit r/MachineLearning

Crafting Your AI Rulebook for Niche DTC Support

Dev.to

Multi-Perspective Context Matching for Machine Comprehension

Dev.to

AutomationBench

Key Points

Abstract

💡 Insights using this article

Related Articles

How to Stop Your AI Coding Assistant From Being Useless at Specialized Tasks

GPT-5.5 System Card

[NeurIPS 2026] Dumb Question about formating [D]

Crafting Your AI Rulebook for Niche DTC Support

Multi-Perspective Context Matching for Machine Comprehension

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer