[D] Your Agent, Their Asset: Real-world safety evaluation of OpenClaw agents (CIK poisoning raises attack success to ~64–74%)

Reddit r/MachineLearning / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reports a real-world safety evaluation of OpenClaw personal AI agents with access to Gmail, Stripe, and the local filesystem, using 12 live-system attack scenarios across multiple models.
It finds that baseline attack success rates are about ~10–36.7%, but can rise to ~64–74% after poisoning a single persistent-state dimension (CIK), with even the strongest model seeing a >3× vulnerability increase.
The authors’ taxonomy of persistent agent state (Capability, Identity, Knowledge) frames the problem as structural rather than model-specific, emphasizing that execution is still reachable once state is compromised.
Existing defenses (e.g., prompt-level alignment, monitoring/logging, and some state protection) do not fully prevent “capability” attacks, and the best defense still leaves capability attack success around ~63.8%.
The paper argues for a shift toward stronger execution-time control via deterministic authorization—evaluating (intent, state, policy) → ALLOW/DENY—so execution occurs only when explicitly authorized.

This paper presents a real-world safety evaluation of OpenClaw, a personal AI agent with access to Gmail, Stripe, and the local filesystem.

The authors introduce a taxonomy of persistent agent state:

- Capability (skills / executable code)

- Identity (persona, trust configuration)

- Knowledge (memory)

They evaluate 12 attack scenarios on a live system across multiple models.

Key results:

- baseline attack success rate: ~10–36.7%

- after poisoning a single dimension (CIK): ~64–74%

- even the strongest model shows >3× increase in vulnerability

- best defense still leaves Capability attacks at ~63.8%

- file protection reduces attacks (~97%) but also blocks legitimate updates at similar rates

The paper argues these vulnerabilities are structural, not model-specific.

One interpretation is that current defenses mostly operate at the behavior or context level:

- prompt-level alignment

- monitoring / logging

- state protection mechanisms

But execution remains reachable once the system state is compromised.

This suggests a different framing:

proposal -> authorization -> execution

where authorization is evaluated deterministically:

(intent, state, policy) -> ALLOW / DENY

and execution is only reachable if explicitly authorized.

Curious how others interpret this:

Is this primarily a persistent state poisoning problem?
A capability isolation / sandboxing problem?
Or evidence that agent systems need a stronger execution-time control layer?

submitted by /u/docybo
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/8DailyView insight →

Black Hat Asia

AI Business

Your AI Agent is Reading Poisoned Web Pages.. Here's How to Stop It

Dev.to

Group Lasso with Overlaps: the Latent Group Lasso approach

Dev.to

I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned

Dev.to

🚀 OpenAI's Secret "Image V2" Just Leaked on LM Arena: The End of Mangled AI Text?

Dev.to

[D] Your Agent, Their Asset: Real-world safety evaluation of OpenClaw agents (CIK poisoning raises attack success to ~64–74%)

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

Your AI Agent is Reading Poisoned Web Pages.. Here's How to Stop It

Group Lasso with Overlaps: the Latent Group Lasso approach

I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned

🚀 OpenAI's Secret "Image V2" Just Leaked on LM Arena: The End of Mangled AI Text?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer