The Real AI Agent Failure Mode Is Uncertain Completion

Dev.to / 3/28/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article argues that common AI agent failure discussions (hallucinations, prompt injection, tool misuse, runaway loops, bad reasoning) miss a more dangerous real-world failure mode: uncertain completion after tool side effects.
  • Uncertain completion occurs when an agent triggers an irreversible external action (e.g., payments, order creation, emails, CRM updates) but the system cannot determine whether the action actually succeeded due to timeouts, crashes, or lost responses.
  • It emphasizes that this is often not a prompting problem; the core issue is the lack of a clean execution boundary and durable state visibility that distinguishes “attempted” from “completed.”
  • The author highlights the hidden trap where systems only log that an attempt happened, not that it completed safely, making retries risky and potentially causing duplicate charges/orders/emails.
  • While idempotency keys can help, the article notes they don’t solve all cases—particularly when downstream systems lack strong idempotency semantics or stable keys—so robust result recovery mechanisms are needed instead of blind retries.

The Real AI Agent Failure Mode Is Uncertain Completion

A lot of AI agent discussion focuses on the wrong failure modes.

People talk about:

hallucinations
prompt injection
tool misuse
runaway loops
bad reasoning

Those are real.

But once an agent starts calling tools that affect the outside world, a different class of failure becomes much more dangerous:

uncertain completion

That is the moment where the system cannot confidently answer:

“Did this action already happen?”

And once that question becomes ambiguous, retries get dangerous very fast.

What uncertain completion actually looks like

A common real-world path looks like this:

agent decides to call send_payment()
→ tool sends the payment request
→ timeout / crash / disconnect / lost response
→ caller does not know if it succeeded
→ retry happens
→ payment may be sent again

The same thing shows up with:

order creation
booking flows
email sends
CRM mutations
support ticket creation
browser / UI automation
webhook-triggered workflows

The model may have made the correct decision.

The failure is that the system has no durable way to prove whether the side effect already happened.

This is not mainly a prompting problem

The agent is often not “being stupid.”

The system is simply missing a clean execution boundary.

That means:

the same logical action can be attempted multiple times
the caller cannot distinguish “attempted” from “completed”
retries are forced to guess

And “guessing” is exactly how you get:

duplicate payments
duplicate emails
duplicate orders
duplicate API mutations
duplicate irreversible actions
The hidden trap: “we logged the attempt”

A lot of systems record that they tried to do something.

That is not the same as recording that it completed safely.

This is where the distinction matters:

State visibility

Can your system durably see:

what was requested
what was claimed
what actually completed
what result should be returned on replay
Result recovery

If the side effect happened but the response was lost, can the system reconstruct what should happen next without re-executing the side effect?

That second part is where many systems break.

Because once the answer becomes:

“we’re not sure, so retry it”

you are already in dangerous territory.

API idempotency helps — but it is not enough

A common response is:

“Just use idempotency keys.”

That is often correct.

And if the downstream API supports strong idempotency semantics, you should absolutely use them.

But that still leaves hard cases:

the downstream API does not support idempotency
the key is not stable across retries
the first call may have succeeded but the caller cannot prove it
the side effect is happening in a browser / UI / desktop automation context
the external system gives weak or ambiguous feedback

In those cases, the problem is no longer just API-level idempotency.

It becomes:

execution-layer safety
The important split: intent vs execution

One of the cleanest ways to think about this is:

the agent should not directly own irreversible side effects

Instead, there should be a separation between:

Agent intent

“I think we should do X”

and

Execution

“X is now allowed to happen exactly once”

That is a very important boundary.

Because once the system separates:

decision
validation
execution
receipt / replay

…then retries stop being so dangerous.

A better pattern: proposal → guard → execute

A safer structure looks more like this:

agent proposes action
→ deterministic layer validates action
→ execution guard checks durable receipt
→ if already completed: return prior result
→ else: execute once and persist receipt

This is a very different mental model from:

agent decides
→ immediately call side-effecting tool

That second pattern is where a lot of production agent systems get into trouble.

The more irreversible the action, the thicker the boundary

Not all tools should be treated equally.

A useful mental model is:

Safe tools

Examples:

search
read_file
summarize
fetch_status

These are usually fine to retry.

Side-effecting tools

Examples:

send_email
create_order
create_ticket
update_CRM

These need an execution boundary.

Irreversible / high-risk tools

Examples:

payment
delete
trade execution
account mutation

These need the strongest boundary:

deterministic identity
durable receipts
replay-safe semantics
often confirmation / policy checks

The principle is simple:

the more irreversible the action, the thicker the execution boundary should be
What systems actually need

In practice, most systems need some combination of:

stable request / operation identity
durable receipt storage
replay-safe execution semantics
result recovery
explicit separation between “propose” and “execute”

That can be implemented many ways.

But the important thing is the architectural boundary itself.

Because once a system can confidently answer:

“yes, this already happened”

then retries become much safer.

Why this keeps showing up in agent systems

Traditional systems already had this problem.

Agents just make it more visible.

Why?

Because agents are:

retry-heavy
tool-using
asynchronous
failure-prone
often layered on top of APIs that were never designed for autonomous replay

So the moment an agent starts touching:

payments
orders
emails
browser actions
external systems

…uncertain completion becomes one of the most important production problems in the stack.

Closing thought

The scariest agent failure is often not:

“the model made the wrong choice”

It is:

“the model made the right choice twice”

And the reason that happens is usually not intelligence failure.

It is:

missing execution boundaries under uncertain completion
Related

I wrote a first piece on the execution-side pattern here:

The Execution Guard Pattern for AI Agents
https://dev.to/azender1/the-execution-guard-pattern-for-ai-agents-23m9

And I’m also building a Python reference implementation around this idea:

GitHub
https://github.com/azender1/SafeAgent