ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

Reddit r/MachineLearning / 4/15/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • ClawBench is introduced as a new benchmark for AI browser agents, testing performance on 153 real-world everyday tasks across 144 live websites.
  • The best reported success rate is only 33.3% (Claude Sonnet 4.6), indicating that even top models struggle to reliably complete everyday online workflows.
  • The benchmark finds notable category differences: finance and academic tasks are easier (about 50% for the best model), while travel and developer-related tasks are substantially harder.
  • ClawBench differs from synthetic tests by using real production sites and capturing 5 layers of behavioral/technical evidence (session replay, screenshots, HTTP traffic, reasoning traces, and browser actions) with a request interceptor to safely prevent final irreversible actions.
  • The authors provide an interactive leaderboard/trace viewer and release the dataset and evaluation tooling for further research and iteration.

We introduce ClawBench, a benchmark that evaluates AI browser agents on 153 real-world everyday tasks across 144 live websites. Unlike synthetic benchmarks, ClawBench tests agents on actual production platforms.

Key findings:

  • The best model (Claude Sonnet 4.6) achieves only 33.3% success rate
  • GLM-5 (Zhipu AI) comes second at 24.2% — surprisingly strong for a text-only model
  • Finance and Academic tasks are easier (50% for the best model); Travel and Dev tasks are much harder
  • No model exceeds 50% in any category — there's a long way to go

What makes ClawBench different:

  • Tasks on real live websites, not sandboxed environments
  • 5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, browser actions
  • Request interceptor blocks the final HTTP request before irreversible actions (payments, bookings), enabling safe evaluation
  • Human ground-truth for every task
  • Agentic evaluator with step-level traceable diagnostics

Resources:

Happy to answer any questions! We're actively looking for feedback on task selection and evaluation methodology.

[R] Research

submitted by /u/Extreme_Play_8554
[link] [comments]