ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

Reddit r/MachineLearning / 4/15/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

ClawBench is introduced as a new benchmark for AI browser agents, testing performance on 153 real-world everyday tasks across 144 live websites.
The best reported success rate is only 33.3% (Claude Sonnet 4.6), indicating that even top models struggle to reliably complete everyday online workflows.
The benchmark finds notable category differences: finance and academic tasks are easier (about 50% for the best model), while travel and developer-related tasks are substantially harder.
ClawBench differs from synthetic tests by using real production sites and capturing 5 layers of behavioral/technical evidence (session replay, screenshots, HTTP traffic, reasoning traces, and browser actions) with a request interceptor to safely prevent final irreversible actions.
The authors provide an interactive leaderboard/trace viewer and release the dataset and evaluation tooling for further research and iteration.

We introduce ClawBench, a benchmark that evaluates AI browser agents on 153 real-world everyday tasks across 144 live websites. Unlike synthetic benchmarks, ClawBench tests agents on actual production platforms.

Key findings:

The best model (Claude Sonnet 4.6) achieves only 33.3% success rate
GLM-5 (Zhipu AI) comes second at 24.2% — surprisingly strong for a text-only model
Finance and Academic tasks are easier (50% for the best model); Travel and Dev tasks are much harder
No model exceeds 50% in any category — there's a long way to go

What makes ClawBench different:

Tasks on real live websites, not sandboxed environments
5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, browser actions
Request interceptor blocks the final HTTP request before irreversible actions (payments, bookings), enabling safe evaluation
Human ground-truth for every task
Agentic evaluator with step-level traceable diagnostics

Resources:

Paper: https://arxiv.org/abs/2604.08523
Website (interactive leaderboard + trace viewer): https://claw-bench.com
Dataset: https://huggingface.co/datasets/NAIL-Group/ClawBench
GitHub: https://github.com/reacher-z/ClawBench
PyPI: pip install clawbench-eval

Happy to answer any questions! We're actively looking for feedback on task selection and evaluation methodology.

[R] Research

submitted by /u/Extreme_Play_8554
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/15DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer