The Real AI Agent Failure Mode Is Uncertain Completion
A lot of AI agent discussion focuses on the wrong failure modes.
People talk about:
hallucinations
prompt injection
tool misuse
runaway loops
bad reasoning
Those are real.
But once an agent starts calling tools that affect the outside world, a different class of failure becomes much more dangerous:
uncertain completion
That is the moment where the system cannot confidently answer:
“Did this action already happen?”
And once that question becomes ambiguous, retries get dangerous very fast.
What uncertain completion actually looks like
A common real-world path looks like this:
agent decides to call send_payment()
→ tool sends the payment request
→ timeout / crash / disconnect / lost response
→ caller does not know if it succeeded
→ retry happens
→ payment may be sent again
The same thing shows up with:
order creation
booking flows
email sends
CRM mutations
support ticket creation
browser / UI automation
webhook-triggered workflows
The model may have made the correct decision.
The failure is that the system has no durable way to prove whether the side effect already happened.
This is not mainly a prompting problem
The agent is often not “being stupid.”
The system is simply missing a clean execution boundary.
That means:
the same logical action can be attempted multiple times
the caller cannot distinguish “attempted” from “completed”
retries are forced to guess
And “guessing” is exactly how you get:
duplicate payments
duplicate emails
duplicate orders
duplicate API mutations
duplicate irreversible actions
The hidden trap: “we logged the attempt”
A lot of systems record that they tried to do something.
That is not the same as recording that it completed safely.
This is where the distinction matters:
State visibility
Can your system durably see:
what was requested
what was claimed
what actually completed
what result should be returned on replay
Result recovery
If the side effect happened but the response was lost, can the system reconstruct what should happen next without re-executing the side effect?
That second part is where many systems break.
Because once the answer becomes:
“we’re not sure, so retry it”
you are already in dangerous territory.
API idempotency helps — but it is not enough
A common response is:
“Just use idempotency keys.”
That is often correct.
And if the downstream API supports strong idempotency semantics, you should absolutely use them.
But that still leaves hard cases:
the downstream API does not support idempotency
the key is not stable across retries
the first call may have succeeded but the caller cannot prove it
the side effect is happening in a browser / UI / desktop automation context
the external system gives weak or ambiguous feedback
In those cases, the problem is no longer just API-level idempotency.
It becomes:
execution-layer safety
The important split: intent vs execution
One of the cleanest ways to think about this is:
the agent should not directly own irreversible side effects
Instead, there should be a separation between:
Agent intent
“I think we should do X”
and
Execution
“X is now allowed to happen exactly once”
That is a very important boundary.
Because once the system separates:
decision
validation
execution
receipt / replay
…then retries stop being so dangerous.
A better pattern: proposal → guard → execute
A safer structure looks more like this:
agent proposes action
→ deterministic layer validates action
→ execution guard checks durable receipt
→ if already completed: return prior result
→ else: execute once and persist receipt
This is a very different mental model from:
agent decides
→ immediately call side-effecting tool
That second pattern is where a lot of production agent systems get into trouble.
The more irreversible the action, the thicker the boundary
Not all tools should be treated equally.
A useful mental model is:
Safe tools
Examples:
search
read_file
summarize
fetch_status
These are usually fine to retry.
Side-effecting tools
Examples:
send_email
create_order
create_ticket
update_CRM
These need an execution boundary.
Irreversible / high-risk tools
Examples:
payment
delete
trade execution
account mutation
These need the strongest boundary:
deterministic identity
durable receipts
replay-safe semantics
often confirmation / policy checks
The principle is simple:
the more irreversible the action, the thicker the execution boundary should be
What systems actually need
In practice, most systems need some combination of:
stable request / operation identity
durable receipt storage
replay-safe execution semantics
result recovery
explicit separation between “propose” and “execute”
That can be implemented many ways.
But the important thing is the architectural boundary itself.
Because once a system can confidently answer:
“yes, this already happened”
then retries become much safer.
Why this keeps showing up in agent systems
Traditional systems already had this problem.
Agents just make it more visible.
Why?
Because agents are:
retry-heavy
tool-using
asynchronous
failure-prone
often layered on top of APIs that were never designed for autonomous replay
So the moment an agent starts touching:
payments
orders
emails
browser actions
external systems
…uncertain completion becomes one of the most important production problems in the stack.
Closing thought
The scariest agent failure is often not:
“the model made the wrong choice”
It is:
“the model made the right choice twice”
And the reason that happens is usually not intelligence failure.
It is:
missing execution boundaries under uncertain completion
Related
I wrote a first piece on the execution-side pattern here:
The Execution Guard Pattern for AI Agents
https://dev.to/azender1/the-execution-guard-pattern-for-ai-agents-23m9
And I’m also building a Python reference implementation around this idea:



