WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

本論文は、LLMベースの「コンピュータ操作エージェント」が自然言語でWeb機能を実装できるようになった一方で、実装の妥当性を自動検証する手法が不足している問題を指摘している。
既存の評価は静的な見た目の類似度や事前定義チェックリストに偏り、開放的な環境での有効性や、ソフトウェア品質における「潜在的な論理制約（latent logical constraints）」の観点が欠けているという。
これらのギャップを埋めるために、エンドツーエンド自動Webテストを評価するベンチマーク「WebTestBench」を提案し、多様なWebアプリカテゴリを横断する包括的な評価軸を含むとしている。
テストを「チェックリスト生成」と「欠陥検出」の2段階に分解し、ベースライン枠組み「WebTester」を提示している。

Abstract

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.

[Boost]

Dev.to

Managing LLM context in a real application

Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

Dev.to

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

Key Points

Abstract

Related Articles

[Boost]

Managing LLM context in a real application

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer