Owner-Harm: A Missing Threat Model for AI Agent Safety

arXiv cs.AI / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current AI agent safety benchmarks overlook a commercially significant threat class: “owner-harm,” where agents damage the people or organizations that deploy them.
  • It points to real incidents (e.g., Slack credential exfiltration, Copilot calendar injection leaks, and a Meta agent posting unauthorized information) as evidence of this gap, and proposes a formal Owner-Harm threat model with eight behavior categories.
  • In experiments, an existing compositional safety system performs extremely well on generic criminal-harm tasks (100% TPR, 0% FPR on AgentHarm) but performs poorly on prompt-injection-mediated owner-harm tasks (only 14.8% TPR on AgentDojo injections).
  • The authors show the deficit is not inherent to “owner-harm” itself (generic-LLM baselines have nearly equal performance), and attribute the failure to environment-bound symbolic rules that do not generalize across tool vocabularies.
  • They further demonstrate that combining a gate with a deterministic post-audit verifier improves detection (raising overall TPR to 85.3% and substantially boosting hijacking detection) and introduce the SSDG framework to relate information coverage to detection rates.

Abstract

Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-world incidents illustrate the gap: Slack AI credential exfiltration (Aug 2024), Microsoft 365 Copilot calendar-injection leaks (Jan 2024), and a Meta agent unauthorized forum post exposing operational data (Mar 2026). We propose Owner-Harm, a formal threat model with eight categories of agent behavior damaging the deployer. We quantify the defense gap on two benchmarks: a compositional safety system achieves 100% TPR / 0% FPR on AgentHarm (generic criminal harm) yet only 14.8% (4/27; 95% CI: 5.9%-32.5%) on AgentDojo injection tasks (prompt-injection-mediated owner harm). A controlled generic-LLM baseline shows the gap is not inherent to owner-harm (62.7% vs. 59.3%, delta 3.4 pp) but arises from environment-bound symbolic rules that fail to generalize across tool vocabularies. On a post-hoc 300-scenario owner-harm benchmark, the gate alone achieves 75.3% TPR / 3.3% FPR; adding a deterministic post-audit verifier raises overall TPR to 85.3% (+10.0 pp) and Hijacking detection from 43.3% to 93.3%, demonstrating strong layer complementarity. We introduce the Symbolic-Semantic Defense Generalization (SSDG) framework relating information coverage to detection rate. Two SSDG experiments partially validate it: context deprivation amplifies the detection gap 3.4x (R = 3.60 vs. R = 1.06); context injection reveals structured goal-action alignment, not text concatenation, is required for effective owner-harm detection.