This OpenClaw paper shows why agent safety is an execution problem, not just a model problem

Reddit r/artificial / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Read original →

共有:

Key Points

The paper argues that agent safety risk is primarily an execution/architecture problem, because even strong models and existing defenses can still execute harmful actions after state has been compromised.
Experiments show that poisoning “Capability / Identity / Knowledge” dramatically increases attack success from ~24.6% to roughly 64–74%, and that the strongest model still exhibits more than a 3x baseline vulnerability.
Even the best defense evaluated leaves capability-targeted attacks highly effective (around ~63.8%), suggesting defenses that only shape prompts or monitor behavior are insufficient.
File protection can block ~97% of attacks, but it also blocks legitimate updates at nearly the same rate, indicating a trade-off between safety hardening and operational continuity.
The author proposes a missing safety boundary in the agent pipeline: enforce an execution-time authorization step (proposal → authorization → execution) with deterministic ALLOW/DENY decisions to eliminate execution paths when policy validation fails.

Paper: https://arxiv.org/abs/2604.04759

This OpenClaw paper is one of the clearest signals so far that agent risk is architectural, not just model quality.

A few results stood out:

- poisoning Capability / Identity / Knowledge pushes attack success from ~24.6% to ~64–74%

- even the strongest model still jumps to more than 3x its baseline vulnerability

- the strongest defense still leaves Capability-targeted attacks at ~63.8%

- file protection blocks ~97% of attacks… but also blocks legitimate updates at almost the same rate

The key point for me is not just that agents can be poisoned.

It’s that execution is still reachable after state is compromised.

That’s where current defenses feel incomplete:

- prompts shape behavior

- monitoring tells you what happened

- file protection freezes the system

But none of these define a hard boundary for whether an action can execute.

This paper basically shows:

if compromised state can still reach execution,

attacks remain viable.

Feels like the missing layer is:

proposal -> authorization -> execution

with a deterministic decision:

(intent, state, policy) -> ALLOW / DENY

and if there’s no valid authorization:

no execution path at all.

Curious how others read this paper.

Do you see this mainly as:

a memory/state poisoning problem
a capability isolation problem
or evidence that agents need an execution-time authorization layer?

submitted by /u/docybo
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/8DailyView insight →

Black Hat Asia

AI Business

Your AI Agent is Reading Poisoned Web Pages.. Here's How to Stop It

Dev.to

Group Lasso with Overlaps: the Latent Group Lasso approach

Dev.to

I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned

Dev.to

🚀 OpenAI's Secret "Image V2" Just Leaked on LM Arena: The End of Mangled AI Text?

Dev.to

This OpenClaw paper shows why agent safety is an execution problem, not just a model problem

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

Your AI Agent is Reading Poisoned Web Pages.. Here's How to Stop It

Group Lasso with Overlaps: the Latent Group Lasso approach

I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned

🚀 OpenAI's Secret "Image V2" Just Leaked on LM Arena: The End of Mangled AI Text?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer