How do you tell users your AI agent is down?

Reddit r/artificial / 3/26/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The post raises a production-reliability question: how to communicate to users when an AI agent fails (e.g., at 3 AM), since traditional status pages only cover HTTP endpoints, not model or agent-specific failures like timeouts, latency, reasoning loops, or context-limit issues.
  • It argues that “partial outage” messaging is often insufficient for agent outages, especially when the underlying problem is within the model provider or RAG pipeline rather than a web service endpoint.
  • The author describes experimenting with the idea of having agents self-manage a status page by monitoring agent workflows such as email processing, task execution, and code deployments.
  • When the monitoring detects failures, it automatically creates and resolves incidents through an API, then asks how others handle user visibility vs internal alerting-only approaches.

Serious question. If you're running an agent in production (customer support bot, coding assistant, data pipeline), what happens when it breaks at 3 AM?

Traditional status pages track HTTP endpoints. They don't understand model providers, agent latency, reasoning loops, or context limits. "Partial outage" doesn't tell your users anything when the real problem is GPT-5.4 timing out or your RAG pipeline choking.

I’m currently exploring letting agents self-manage its own status page. Haven't seen another status page do this and I’m hooked.

I use it to monitor the agent. It tracks email processing, task execution, and code deployment. When it detects a failure, it creates an incident via the API and resolves it when it recovers.

How are you all handling this? Internal alerting only, or do your end users get visibility into agent health?

submitted by /u/codenamev
[link] [comments]