OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

arXiv cs.CL / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces OS-SPEAR, a toolkit designed to rigorously evaluate OS agents that operate in complex GUIs, with emphasis on Safety, Performance, Efficiency, and Robustness.
  • It addresses shortcomings in existing benchmarks by providing four specialized dataset subsets, including hazard-rich safety scenarios, performance sampling guided by trajectory value estimation, efficiency metrics based on latency and token consumption, and robustness testing via cross-modal disturbances.
  • The toolkit includes an automated analysis tool that produces human-readable diagnostic reports to help interpret agent behavior and failure modes.
  • Using OS-SPEAR, the authors evaluated 22 popular OS agents and found recurring efficiency–safety/robustness trade-offs, improved performance from specialized agents versus general-purpose models, and modality-dependent robustness weaknesses.
  • The authors release the dataset and code publicly to support standardized, multidimensional ranking and development of more reliable, efficient OS agents.

Abstract

The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.