OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
arXiv cs.CL / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces OS-SPEAR, a toolkit designed to rigorously evaluate OS agents that operate in complex GUIs, with emphasis on Safety, Performance, Efficiency, and Robustness.
- It addresses shortcomings in existing benchmarks by providing four specialized dataset subsets, including hazard-rich safety scenarios, performance sampling guided by trajectory value estimation, efficiency metrics based on latency and token consumption, and robustness testing via cross-modal disturbances.
- The toolkit includes an automated analysis tool that produces human-readable diagnostic reports to help interpret agent behavior and failure modes.
- Using OS-SPEAR, the authors evaluated 22 popular OS agents and found recurring efficiency–safety/robustness trade-offs, improved performance from specialized agents versus general-purpose models, and modality-dependent robustness weaknesses.
- The authors release the dataset and code publicly to support standardized, multidimensional ranking and development of more reliable, efficient OS agents.
Related Articles
How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI
MarkTechPost

An improvement of the convergence proof of the ADAM-Optimizer
Dev.to
Claude Code 会话历史在哪里?如何找回你的 AI 编程对话记录
Dev.to
We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.
Reddit r/artificial
langchain-tests==1.1.7
LangChain Releases