CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
arXiv cs.LG / 2026/3/26
📰 ニュースSignals & Early TrendsModels & Research
要点
- The paper introduces CUA-Suite to address a key bottleneck in computer-use agents: the lack of large-scale, continuous, high-quality human demonstration video rather than sparse screenshots.
- CUA-Suite’s core component, VideoCUA, contains about 10,000 expert tasks across 87 desktop applications with continuous 30 fps recordings, cursor kinematics, and multi-layered reasoning annotations—totaling ~55 hours and ~6M frames.
- The suite also adds UI-Vision for benchmarking grounding and planning, and GroundCUA, which provides 56K annotated screenshots plus 3.6M UI element annotations for fine-grained grounding.
- Preliminary results indicate that current foundation action models have substantial difficulty on professional desktop applications, with roughly a 60% task failure rate.
- The authors argue the multimodal, temporally rich dataset enables new research directions such as screen parsing, continuous spatial control, video-based reward modeling, and visual world models, with data and models publicly released.


