The Art of Building Verifiers for Computer Use Agents

arXiv cs.AI / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper argues that reliable verification of computer use agent (CUA) trajectories is essential because otherwise evaluation and training signals become untrustworthy.
  • It presents the “Universal Verifier,” built on four principles: meaningful non-overlapping rubrics, separate process vs. outcome rewards, controllable vs. uncontrollable failure scoring, and divide-and-conquer screenshot context management for long horizons.
  • The authors validate the approach on the newly released CUAVerifierBench dataset with both process and outcome human labels, finding human-level agreement rates.
  • The Universal Verifier dramatically reduces false positives (to near zero) versus baselines like WebVoyager (≥45%) and WebJudge (≥22%), attributing gains to the combined design choices.
  • The work also notes that an auto-research agent can reach ~70% expert quality quickly but may not discover all elements needed to replicate the verifier, and the system plus benchmark are open-sourced by Microsoft.

Abstract

Verifying the success of computer use agent (CUA) trajectories is a critical challenge: without reliable verification, neither evaluation nor training signal can be trusted. In this paper, we present lessons learned from building a best-in-class verifier for web tasks we call the Universal Verifier. We design the Universal Verifier around four key principles: 1) constructing rubrics with meaningful, non-overlapping criteria to reduce noise; 2) separating process and outcome rewards that yield complementary signals, capturing cases where an agent follows the right steps but gets blocked or succeeds through an unexpected path; 3) distinguishing between controllable and uncontrollable failures scored via a cascading-error-free strategy for finer-grained failure understanding; and 4) a divide-and-conquer context management scheme that attends to all screenshots in a trajectory, improving reliability on longer task horizons. We validate these findings on CUAVerifierBench, a new set of CUA trajectories with both process and outcome human labels, showing that our Universal Verifier agrees with humans as often as humans agree with each other. We report a reduction in false positive rates to near zero compared to baselines like WebVoyager (\geq 45\%) and WebJudge (\geq 22\%). We emphasize that these gains stem from the cumulative effect of the design choices above. We also find that an auto-research agent achieves 70\% of expert quality in 5\% of the time, but fails to discover all strategies required to replicate the Universal Verifier. We open-source our Universal Verifier system along with CUAVerifierBench; available at https://github.com/microsoft/fara.