CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
arXiv cs.LG / 3/26/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces CUA-Suite to address a key bottleneck in computer-use agents: the lack of large-scale, continuous, high-quality human demonstration video rather than sparse screenshots.
- CUA-Suite’s core component, VideoCUA, contains about 10,000 expert tasks across 87 desktop applications with continuous 30 fps recordings, cursor kinematics, and multi-layered reasoning annotations—totaling ~55 hours and ~6M frames.
- The suite also adds UI-Vision for benchmarking grounding and planning, and GroundCUA, which provides 56K annotated screenshots plus 3.6M UI element annotations for fine-grained grounding.
- Preliminary results indicate that current foundation action models have substantial difficulty on professional desktop applications, with roughly a 60% task failure rate.
- The authors argue the multimodal, temporally rich dataset enables new research directions such as screen parsing, continuous spatial control, video-based reward modeling, and visual world models, with data and models publicly released.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to

I asked my AI agent to design a product launch image. Here's what came back.
Dev.to
They Did Not Accidentally Make Work the Answer to Who You Are
Dev.to
Welsh government used Copilot for review to justify closing organization
The Register