[D] Your Agent, Their Asset: Real-world safety evaluation of OpenClaw agents (CIK poisoning raises attack success to ~64–74%)

Reddit r/MachineLearning / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reports a real-world safety evaluation of OpenClaw personal AI agents with access to Gmail, Stripe, and the local filesystem, using 12 live-system attack scenarios across multiple models.
  • It finds that baseline attack success rates are about ~10–36.7%, but can rise to ~64–74% after poisoning a single persistent-state dimension (CIK), with even the strongest model seeing a >3× vulnerability increase.
  • The authors’ taxonomy of persistent agent state (Capability, Identity, Knowledge) frames the problem as structural rather than model-specific, emphasizing that execution is still reachable once state is compromised.
  • Existing defenses (e.g., prompt-level alignment, monitoring/logging, and some state protection) do not fully prevent “capability” attacks, and the best defense still leaves capability attack success around ~63.8%.
  • The paper argues for a shift toward stronger execution-time control via deterministic authorization—evaluating (intent, state, policy) → ALLOW/DENY—so execution occurs only when explicitly authorized.

Paper: https://arxiv.org/abs/2604.04759

This paper presents a real-world safety evaluation of OpenClaw, a personal AI agent with access to Gmail, Stripe, and the local filesystem.

The authors introduce a taxonomy of persistent agent state:

- Capability (skills / executable code)

- Identity (persona, trust configuration)

- Knowledge (memory)

They evaluate 12 attack scenarios on a live system across multiple models.

Key results:

- baseline attack success rate: ~10–36.7%

- after poisoning a single dimension (CIK): ~64–74%

- even the strongest model shows >3× increase in vulnerability

- best defense still leaves Capability attacks at ~63.8%

- file protection reduces attacks (~97%) but also blocks legitimate updates at similar rates

The paper argues these vulnerabilities are structural, not model-specific.

One interpretation is that current defenses mostly operate at the behavior or context level:

- prompt-level alignment

- monitoring / logging

- state protection mechanisms

But execution remains reachable once the system state is compromised.

This suggests a different framing:

proposal -> authorization -> execution

where authorization is evaluated deterministically:

(intent, state, policy) -> ALLOW / DENY

and execution is only reachable if explicitly authorized.

Curious how others interpret this:

  1. Is this primarily a persistent state poisoning problem?

  2. A capability isolation / sandboxing problem?

  3. Or evidence that agent systems need a stronger execution-time control layer?

submitted by /u/docybo
[link] [comments]