Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

arXiv cs.AI / 4/22/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Desktop GUI agents that rely on screenshot-and-click loops can suffer from an observation-to-action gap, creating a TOCTOU-style vulnerability window attackers can exploit.
  • The paper formalizes this issue as a Visual Atomicity Violation and demonstrates three attack primitives, including notification overlay hijacking, window focus manipulation, and web DOM injection.
  • Window focus manipulation is shown to redirect agent actions with a 100% success rate while leaving no visual evidence at the observation time.
  • The proposed Pre-execution UI State Verification (PUSV) defense re-checks UI state immediately before each action using layered checks (pixel-level SSIM at the target, screenshot diffs, and X Window snapshot diffs).
  • PUSV achieves 100% action interception across 180 adversarial trials with zero false positives and under 0.1s overhead, but still reveals a blind spot against DOM-injection attacks, indicating a need for deeper OS+DOM defenses.

Abstract

GUI agents that control desktop computers via screenshot-and-click loops introduce a new class of vulnerability: the observation-to-action gap (mean 6.51 s on real OSWorld workloads) creates a Time-Of-Check, Time-Of-Use (TOCTOU) window during which an unprivileged attacker can manipulate the UI state. We formalize this as a Visual Atomicity Violation and characterize three concrete attack primitives: (A) Notification Overlay Hijack, (B) Window Focus Manipulation, and (C) Web DOM Injection. Primitive B, the closest desktop analog to Android Action Rebinding, achieves 100% action-redirection success rate with zero visual evidence at the observation time. We propose Pre-execution UI State Verification (PUSV), a lightweight three-layer defense that re-verifies the UI state immediately before each action dispatch: masked pixel SSIM at the click target (L1), global screenshot diff (L2a), and X Window snapshot diff (L2b). PUSV achieves 100% Action Interception Rate across 180 adversarial trials (135 Primitive A + 45 Primitive B) with zero false positives and < 0.1 s overhead. Against Primitive C (zero-visual-footprint DOM injection), PUSV reveals a structural blind spot (~0% AIR), motivating future OS+DOM defense-in-depth architectures. No single PUSV layer alone achieves full coverage; different primitives require different detection signals, validating the layered design.