WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

arXiv cs.AI / 5/1/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces WindowsWorld, a new benchmark for autonomous GUI agents that evaluates performance in realistic cross-application, multi-step professional workflows rather than isolated single-app tasks.
WindowsWorld uses a multi-agent framework guided by 16 occupations to create four difficulty levels, with tasks refined via human review and executed in a simulated desktop environment.
The benchmark includes 181 tasks spanning 17 common desktop applications, where 78% of tasks inherently require coordination across multiple applications and average 5.0 sub-goals.
Experiments with leading large models and agents find very low success rates on multi-application tasks (<21%), difficulty with conditional judgment and reasoning across three or more apps, and low execution efficiency (failures even after exceeding human step limits).
The authors release code, benchmark data, and evaluation resources on GitHub to support further development and assessment of cross-application GUI agents.

Abstract

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across

\geq

3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Dev.to

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Dev.to

Every Telegram conversation becomes a qualified lead. BizNode captures name, email, and business details automatically while...

Dev.to

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Dev.to

Pentagon strikes classified AI deals with OpenAI, Google, and Nvidia — but not Anthropic

The Verge

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Key Points

Abstract

Related Articles

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Every Telegram conversation becomes a qualified lead. BizNode captures name, email, and business details automatically while...

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Pentagon strikes classified AI deals with OpenAI, Google, and Nvidia — but not Anthropic

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer