CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

arXiv cs.LG / 2026/3/26

📰 ニュースSignals & Early TrendsModels & Research

共有:

要点

The paper introduces CUA-Suite to address a key bottleneck in computer-use agents: the lack of large-scale, continuous, high-quality human demonstration video rather than sparse screenshots.
CUA-Suite’s core component, VideoCUA, contains about 10,000 expert tasks across 87 desktop applications with continuous 30 fps recordings, cursor kinematics, and multi-layered reasoning annotations—totaling ~55 hours and ~6M frames.
The suite also adds UI-Vision for benchmarking grounding and planning, and GroundCUA, which provides 56K annotated screenshots plus 3.6M UI element annotations for fine-grained grounding.
Preliminary results indicate that current foundation action models have substantial difficulty on professional desktop applications, with roughly a 60% task failure rate.
The authors argue the multimodal, temporally rich dataset enables new research directions such as screen parsing, continuous spatial control, video-based reward modeling, and visual world models, with data and models publicly released.

Abstract

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

テクノロジー「AI警告危険人物」

note

ぽんず｜管理職のAI仕事術

note

あかうさ📸

note

Sora終了で見えた、サム・アルトマンの本当の狙い──OpenAIは「魅せるAI」ではなく「仕事を代行するAI」へ向かっている

note

AIに仕事を奪われる前に、一緒に動き出しませんか？

note

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

要点

Abstract

関連記事

テクノロジー「AI警告危険人物」

ぽんず｜管理職のAI仕事術

あかうさ📸

Sora終了で見えた、サム・アルトマンの本当の狙い──OpenAIは「魅せるAI」ではなく「仕事を代行するAI」へ向かっている

AIに仕事を奪われる前に、一緒に動き出しませんか？

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer