Web-task AI hits 60% on long-task benchmarks
Introduction to AI Agent DevelopmentMicrosoft Research published Webwright on May 24, 2026 — a terminal-native Web agent framework that replaces step-by-step click traces with reusable Playwright scripts. Using a single agent loop of ~1,000 lines, it scored 60.1% on the Odysseys long-task benchmark (vs. 33.5% for GPT-5.4 alone) and 86.7% on Online-Mind2Web, setting the highest AutoEval score among open-source harnesses at the time. The design principle: invest in the tool layer (reusable scripts) rather than complex orchestration.
Until recently, Web agents worked by narrating every click — the model re-evaluated after each page change, which meant long tasks regularly got stuck mid-way. Public demos only showcased short, controlled sequences. GPT-5.4 alone cleared just 33.5% of the Odysseys benchmark, so anything beyond simple form-filling was still experimental. The dominant view was that smarter models would eventually close the gap; few expected tool-design changes to do it faster.
Teams exploring letting AI handle browser tasks can now point to concrete benchmark numbers, not just demo videos. Form-filling, data-scraping, and routine web workflows become realistic candidates for integration. That said, tasks requiring real-time judgment or many conditional branches are still unreliable at this stage — human-in-the-loop remains the safe posture. For engineers evaluating agent frameworks, Webwright's lean-orchestration, rich-tool-layer design is worth studying regardless of whether you adopt it directly.