Applying SRE principles to AI agents in production — ownership, observability, SLOs, runbooks, and the kill switch pattern.
I've spent a year closely studying how AI agents fail in the wild — across incidents, postmortems, and real operational patterns — and what I keep noticing is a gap nobody talks about. Teams celebrate capability. Nobody builds operational readiness.
Here's what that gap costs, and how to close it.
The Gap: AI Agents Are Treated Like Features, Not Services
In traditional SRE, every production service has:
✅ A named owner who carries the pager
✅ A defined SLO
✅ An on-call rotation
✅ A runbook
✅ A postmortem process
Most AI agents have a demo video and a Slack channel.
This is a category error. An agent is not a feature. It is an autonomous decision-making service operating at the speed of your automation. When it fails, it doesn't fail quietly like a broken button. It fails at the rate of your automation — and often with external side effects: emails sent, APIs called, records written.
The Failure Nobody Talks About
The failure everyone prepares for is the hard failure: an exception thrown, a timeout, a 500 error. These are easy to catch. CloudWatch alarm, SNS notification, done.
The failure nobody prepares for is the silent degradation.
The agent completes tasks. Dashboards stay green. But for the last 6 hours, its reasoning has been subtly wrong — selecting the wrong tools, misinterpreting scope, producing outputs that look correct and aren't.
This is the worst case. Not failure. Plausible, undetected, incorrect action at scale.
Traditional observability doesn't catch this. You need a new layer.
Introducing HER: Human Escalation Rate
The most useful signal I've seen for agent health is one most teams don't track:
HER = (decisions requiring human override / total decisions) × 100
HER is to AI agents what error rate is to APIs. It tells you whether the agent's judgment is holding up.
Here's a simple implementation:
pythondef publish_her_metric(agent_id: str, human_overrides: int, total_decisions: int):
her = (human_overrides / total_decisions) * 100 if total_decisions > 0 else 0
# Push to your metrics store
metrics.gauge(
"agent.human_escalation_rate",
her,
tags=[f"agent_id:{agent_id}"]
)
# Alert if above threshold
if her > THRESHOLD:
alert_oncall_owner(agent_id, her)
return her
When HER exceeds your threshold, a named human gets paged. Not a team. Not a Slack channel. A person.
Three Requirements Before Any Agent Goes to Production
Based on everything I've observed and learned, here's what I consider non-negotiable.
- A Named Human Owner Who Gets Paged The ownership model matters more than the tooling. Every agent must have a named individual who is accountable when HER exceeds threshold. Shared ownership is no ownership. "The AI team owns it" means nobody owns it. Write it down: yamlagent: name: document-processor-v2 owner: ajay.devineni@company.com pager: +1-xxx-xxx-xxxx slack_handle: "@ajay" escalation_policy: p1-sre-rotation
- A Runbook That Covers At Least Four Failure Modes Before any agent ships, a runbook must exist. Minimum coverage: Failure ModeWhat to look forImmediate actionTool failureTool error rate spikesCheck dependency health, assess in-flight tasksContext degradationOutput length increases, HER spikesInspect conversation history, rollback promptPrompt driftBehavioral baseline deviationFreeze deploys, compare prompt versionsBlast radius eventAgent operating outside defined scopeInvoke kill switch, audit side effects A runbook doesn't need to be 20 pages. It needs to be right and reachable at 2am.
- A 30-Day Behavioral Baseline Before Any SLO Is Set This is the one most teams skip because it feels slow. You cannot commit to reliability you have not measured. Run your agent in shadow mode for 30 days — processing real inputs, generating real outputs, but reviewed before action. During that window, measure everything:
Task completion rate
Human escalation rate (baseline HER)
Tool call accuracy
Decision latency (p50/p95/p99)
Context window utilization
Output quality score variance across identical inputs
Only after 30 days do you write an SLO. The baseline IS the SLO foundation.
yaml# Example SLO written after baseline
agent_slo:
valid_from: "after-30d-baseline"
objectives:
- metric: task_completion_rate
target: 99.2%
baseline_observed: 99.6% # headroom built in intentionally
- metric: human_escalation_rate
target: "< 3%"
baseline_observed: 1.8%
alert_threshold: 5%
The Kill Switch Pattern
Every production agent needs a kill switch — a mechanism to halt execution immediately, without a code deployment.
pythondef check_kill_switch(agent_id: str) -> bool:
"""
Checks a config store for kill switch status.
Works with SSM Parameter Store, LaunchDarkly,
or any feature flag system.
"""
status = config_store.get(f"agents/{agent_id}/kill-switch")
return status == "ACTIVE"
def agent_task_loop(agent_id: str, tasks: list):
for task in tasks:
# Check before EVERY decision, not just at startup
if check_kill_switch(agent_id):
log_halt(agent_id, task)
raise AgentHaltException("Kill switch active")
execute(task)
The kill switch should be:
Flipable without a deployment (config store, not code)
Checked before every decision, not just at startup
Audited — log every check and every activation
What the Observability Stack Actually Looks Like
Agent Runtime
│
├──▶ Structured logs (JSON, one entry per decision)
│ └── Fields: session_id, tool_calls, human_override, confidence, latency
│
├──▶ Custom metrics
│ └── HER, tool error rate, context utilization, decision latency
│
├──▶ Distributed traces
│ └── End-to-end: input → LLM → tool calls → output
│
├──▶ Event stream (one event per agent decision)
│ └── Powers alerting rules and downstream audit
│
└──▶ Decision audit log (immutable)
└── S3 / blob store, retained for postmortem analysis
Every agent decision should emit a structured log entry:
json{
"timestamp": "2025-01-15T14:23:01Z",
"agent_id": "doc-processor-v2",
"session_id": "sess_abc123",
"tools_called": ["search", "summarize"],
"tool_success": [true, true],
"human_override": false,
"context_utilization_pct": 47.1,
"latency_ms": 3420,
"task_completed": true
}
This is your audit trail. This is what you bring to a postmortem.
The Postmortem Question Nobody Asks
After an incident with a traditional service, postmortems ask:
What failed?
Why did it fail?
How do we prevent recurrence?
For AI agents, there's a fourth question that almost nobody asks:
Was there a window where the agent was wrong, and we didn't know?
Silent degradation periods are invisible in traditional postmortems because the dashboards were green. Adding a behavioral baseline comparison to every postmortem template forces this question into the open.
Is Your Agent Production-Ready or Demo-Ready?
The SRE community spent 20 years learning how to operate distributed systems reliably. Those lessons — ownership, observability, SLOs, runbooks, postmortems — weren't invented in conference rooms. They were earned through outages.
AI agents are distributed systems with an additional dimension of unpredictability: they make decisions.
Before your next agent ships, run this checklist:
Named human owner with pager assigned
Runbook covering tool failure, context degradation, prompt drift, blast radius
HER metric instrumented and alerting
Kill switch implemented and tested
30-day shadow mode baseline completed
SLO written and derived from baseline data
Postmortem template updated to include behavioral baseline comparison
If any box is unchecked, your agent is demo-ready. Not production-ready.
Author: Ajay Devineni | Connect on LinkedIn




