ClawBench: Can AI Agents Complete Everyday Online Tasks?

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

ClawBench is a new evaluation framework on arXiv that tests AI agents on 153 everyday online tasks across 144 live platforms and 15 categories, including purchases, booking, and job applications.
The benchmark is designed to reflect real-world web interaction by operating on production websites rather than offline sandboxes, preserving dynamic content and multi-step workflow complexity.
Tasks explicitly require agent capabilities beyond prior benchmarks, such as extracting information from user-provided documents, navigating diverse multi-step flows, and performing write-heavy form filling accurately.
A lightweight interception layer blocks only final submission requests to enable safe evaluation without causing real-world side effects.
Initial results evaluating 7 frontier models show both proprietary and open-source agents can complete only a small fraction of tasks, with Claude Sonnet 4.6 reaching 33.3%, indicating substantial room for improvement toward reliable general-purpose assistants.

Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

Dev.to

How To Optimize Enterprise AI Energy Consumption

Dev.to

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Key Points

Abstract

Related Articles

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

How To Optimize Enterprise AI Energy Consumption

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer