[Patterns] AI Agent Error Handling That Actually Works

Dev.to / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The article argues that most AI agent tutorials ignore production issues, so it focuses on practical error-handling patterns that prevent incidents.
It proposes classifying agent failures into “transient” (e.g., rate limits, timeouts, temporary network/model overload) and “permanent” (e.g., invalid API keys, malformed prompts, context window exceeded), since retrying only makes sense for transient errors.
It provides a concrete Python-style error classifier that maps HTTP status codes (429/5xx) and timeout-like messages to the transient bucket.
The core takeaway is that effective error handling starts with correct categorization, which then informs how retries, fallbacks, and other downstream behaviors should work.
The author frames the guidance as based on real operational experience running a small number of agents, emphasizing handling failures in code to avoid on-call disruptions.

Most AI agent tutorials show the happy path. Your agent calls an LLM, gets a response, does the thing. Ship it.

Then production happens. Rate limits. Timeouts. Malformed responses. Context window overflows. Your agent goes from "demo-ready" to "incident-generating" in about 48 hours.

I run a small operation — 5 agents max, solo founder. Every failure that wakes me up at 3am is one I should have handled in code. Here are the patterns that actually work.

Classify Your Errors First

Not all errors deserve the same treatment. The first thing I do in any agent system is classify failures into two buckets:

Transient errors: Rate limits (429), timeouts, temporary network blips, model overload. These will probably work if you try again.

Permanent errors: Invalid API keys, malformed prompts, context window exceeded, model doesn't exist. Retrying won't help.

class ErrorClassifier:
    TRANSIENT_CODES = {429, 500, 502, 503, 504}

    @staticmethod
    def classify(error):
        if hasattr(error, 'status_code'):
            if error.status_code in ErrorClassifier.TRANSIENT_CODES:
                return "transient"
        if "timeout" in str(error).lower():
            return "transient"
        return "permanent"

This classification drives everything downstream. Transient errors get retries. Permanent errors get logged, reported, and gracefully degraded. When you're thinking about agent security patterns, error classification also matters — permanent auth errors need different alerting than transient network hiccups.

Retry Strategies That Don't Make Things Worse

The naive approach — retry immediately, retry forever — is how you turn a rate limit into a ban. Exponential backoff with jitter is the baseline:

import random
import time

def retry_with_backoff(fn, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if ErrorClassifier.classify(e) == "permanent":
                raise  # Don't retry permanent errors

            if attempt == max_retries - 1:
                raise

            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)

Key details: jitter prevents thundering herd when multiple agents hit the same limit. And always cap your retries — 3 is usually enough. If it hasn't worked in 3 tries, it's not going to work in 30.

Circuit Breakers for LLM Calls

Retries handle individual failures. Circuit breakers handle systemic ones. If your LLM provider is having a bad day, you don't want every request queuing up and timing out.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_time=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_time = recovery_time
        self.last_failure_time = None
        self.state = "closed"  # closed = normal, open = blocking

    def call(self, fn):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.recovery_time:
                self.state = "half-open"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = fn()
            if self.state == "half-open":
                self.state = "closed"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "open"
            raise

I wrap every external LLM call in a circuit breaker. When the circuit opens, agents fall back to cached responses or simpler logic instead of piling up failures. If you're taking an observability-first approach, you'll want to track circuit state transitions — they're one of the best early warning signals.

Fallback Chains: Your Safety Net

When your primary model fails, having a fallback chain prevents total outage:

FALLBACK_CHAIN = [
    {"provider": "anthropic", "model": "claude-sonnet-4-20250514"},
    {"provider": "openai", "model": "gpt-4o-mini"},
    {"provider": "local", "model": "cached_response"},
]

def call_with_fallback(prompt, chain=FALLBACK_CHAIN):
    errors = []
    for option in chain:
        try:
            return call_model(option["provider"], option["model"], prompt)
        except Exception as e:
            errors.append(f"{option['provider']}: {e}")
            continue
    raise AllProvidersFailedError(
        f"All {len(chain)} providers failed: {'; '.join(errors)}"
    )

The chain degrades gracefully: premium model → cheaper model → cached/static response. Your users get something even when everything is on fire.

Timeout Handling

LLM calls are slow. An agent waiting 120 seconds for a response that's never coming is wasting resources and blocking downstream work.

import asyncio

async def call_with_timeout(coro, timeout_seconds=30):
    try:
        return await asyncio.wait_for(coro, timeout=timeout_seconds)
    except asyncio.TimeoutError:
        raise TimeoutError(f"LLM call exceeded {timeout_seconds}s limit")

Set aggressive timeouts. For most agent tasks, if you haven't gotten a response in 30 seconds, something is wrong. I default to 30s for completions and 10s for embeddings.

Putting It All Together

Here's how these patterns compose in a real agent:

async def agent_execute(task):
    breaker = get_circuit_breaker("llm_calls")

    try:
        result = breaker.call(
            lambda: retry_with_backoff(
                lambda: call_with_fallback(task.prompt),
                max_retries=3
            )
        )
        return AgentResult(status="success", data=result)

    except CircuitOpenError:
        return AgentResult(
            status="degraded",
            data=get_cached_response(task),
            note="Using cached response - LLM circuit open"
        )
    except AllProvidersFailedError:
        return AgentResult(
            status="failed",
            data=None,
            note="All providers unavailable"
        )

The key insight: every layer has a defined failure mode. Timeouts prevent hangs. Retries handle blips. Circuit breakers prevent cascading failures. Fallbacks provide degraded-but-functional responses.

What I Track

Error handling is only useful if you know it's working. For my small setup, I track:

Error classification distribution — am I seeing more transient or permanent errors?
Circuit breaker state changes — how often are circuits opening?
Fallback chain depth — how far down the chain are requests going?
Retry success rate — are retries actually recovering errors?

Having real-time error monitoring changed how I build agents. Instead of finding out about failures from users, I catch patterns before they become outages.

The Boring Truth

None of these patterns are novel. Circuit breakers come from distributed systems. Retry with backoff is older than most of us. Fallback chains are just failover by another name.

But applying them specifically to AI agents — where failures are probabilistic, responses are non-deterministic, and costs compound with every retry — that's where the craft is. Start with error classification, layer on retries, add circuit breakers, and build fallback chains. Your 3am self will thank you.

Black Hat USA

AI Business

Black Hat Asia

AI Business

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

[Patterns] AI Agent Error Handling That Actually Works

Key Points

Classify Your Errors First

Retry Strategies That Don't Make Things Worse

Circuit Breakers for LLM Calls

Fallback Chains: Your Safety Net

Timeout Handling

Putting It All Together

What I Track

The Boring Truth

Related Articles

Black Hat USA

Black Hat Asia

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer