ClawBench: Can AI Agents Complete Everyday Online Tasks?
arXiv cs.CL / 4/10/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- ClawBench is a new evaluation framework on arXiv that tests AI agents on 153 everyday online tasks across 144 live platforms and 15 categories, including purchases, booking, and job applications.
- The benchmark is designed to reflect real-world web interaction by operating on production websites rather than offline sandboxes, preserving dynamic content and multi-step workflow complexity.
- Tasks explicitly require agent capabilities beyond prior benchmarks, such as extracting information from user-provided documents, navigating diverse multi-step flows, and performing write-heavy form filling accurately.
- A lightweight interception layer blocks only final submission requests to enable safe evaluation without causing real-world side effects.
- Initial results evaluating 7 frontier models show both proprietary and open-source agents can complete only a small fraction of tasks, with Claude Sonnet 4.6 reaching 33.3%, indicating substantial room for improvement toward reliable general-purpose assistants.
Related Articles
CIA is trusting AI to help analyze intel from human spies
Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table
Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.
Dev.to
Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios
Dev.to

How To Optimize Enterprise AI Energy Consumption
Dev.to