This OpenClaw paper shows why agent safety is an execution problem, not just a model problem

Reddit r/artificial / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that agent safety risk is primarily an execution/architecture problem, because even strong models and existing defenses can still execute harmful actions after state has been compromised.
  • Experiments show that poisoning “Capability / Identity / Knowledge” dramatically increases attack success from ~24.6% to roughly 64–74%, and that the strongest model still exhibits more than a 3x baseline vulnerability.
  • Even the best defense evaluated leaves capability-targeted attacks highly effective (around ~63.8%), suggesting defenses that only shape prompts or monitor behavior are insufficient.
  • File protection can block ~97% of attacks, but it also blocks legitimate updates at nearly the same rate, indicating a trade-off between safety hardening and operational continuity.
  • The author proposes a missing safety boundary in the agent pipeline: enforce an execution-time authorization step (proposal → authorization → execution) with deterministic ALLOW/DENY decisions to eliminate execution paths when policy validation fails.

Paper: https://arxiv.org/abs/2604.04759

This OpenClaw paper is one of the clearest signals so far that agent risk is architectural, not just model quality.

A few results stood out:

- poisoning Capability / Identity / Knowledge pushes attack success from ~24.6% to ~64–74%

- even the strongest model still jumps to more than 3x its baseline vulnerability

- the strongest defense still leaves Capability-targeted attacks at ~63.8%

- file protection blocks ~97% of attacks… but also blocks legitimate updates at almost the same rate

The key point for me is not just that agents can be poisoned.

It’s that execution is still reachable after state is compromised.

That’s where current defenses feel incomplete:

- prompts shape behavior

- monitoring tells you what happened

- file protection freezes the system

But none of these define a hard boundary for whether an action can execute.

This paper basically shows:

if compromised state can still reach execution,

attacks remain viable.

Feels like the missing layer is:

proposal -> authorization -> execution

with a deterministic decision:

(intent, state, policy) -> ALLOW / DENY

and if there’s no valid authorization:

no execution path at all.

Curious how others read this paper.

Do you see this mainly as:

  1. a memory/state poisoning problem

  2. a capability isolation problem

  3. or evidence that agents need an execution-time authorization layer?

submitted by /u/docybo
[link] [comments]