Video-Based Reward Modeling for Computer-Use Agents

arXiv cs.CL / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces video-based reward modeling for computer-use agents (CUAs) by using execution video rather than the agent's internal reasoning, enabling evaluation that is independent of hidden thoughts or actions.
It releases ExeVR-53k, a dataset of 53k video–task–reward triplets, and uses adversarial instruction translation to generate negative samples with step-level annotations.
The approach includes spatiotemporal token pruning to efficiently learn from long, high-resolution execution videos while preserving decisive UI changes.
An Execution Video Reward Model (ExeVRM) is fine-tuned to predict task success from a user instruction and a video sequence, achieving 84.7% accuracy and 87.7% recall and outperforming proprietary models across Ubuntu, macOS, Windows, and Android.

Abstract

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

Dev.to

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

AI Agents Are Already Breaking Things — And We've Barely Started

Dev.to

OpenAI is throwing everything into building a fully automated researcher

MIT Technology Review

Video-Based Reward Modeling for Computer-Use Agents

Key Points

Abstract

Related Articles

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

When AI Grows Up: Identity, Memory, and What Persists Across Versions

AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

AI Agents Are Already Breaking Things — And We've Barely Started

OpenAI is throwing everything into building a fully automated researcher

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer