After 6 months of running AI agents in production I think the framework you pick barely matters. The thing that kills them is something else.

Reddit r/artificial / 5/24/2026

💬 OpinionIdeas & Deep AnalysisIndustry & Market Moves

Key Points

  • After running roughly 30 AI agents in production for paying customers over six months, the author argues that the choice of orchestration framework (e.g., LangChain, CrewAI, AutoGen, OpenAI Agents SDK) is mostly a distraction.
  • The biggest production failures come from runtime and system issues—especially infinite loops caused by ambiguous downstream outputs—leading to rapid cost overruns and lack of clarity on which agent caused the spend.
  • Crashes and restarts (such as VPS reboots) expose weak state handling and memory persistence, causing agents to lose in-progress work and forget prior context like ticket histories.
  • Debugging and customer disputes are hindered by missing observability, including the absence of detailed records of what the agent saw, decided, and which tool calls it made.
  • To make agents reliable, the author emphasizes a “real stack” centered on persistent memory, loop detection at the runtime layer, tamper-resistant audit trails, shared memory across collaborating agents, and per-agent cost tracking.

Going to get downvoted for this but here we go. I've been running about 30 agents in production for paying customers for the last 6 months and I'm convinced the framework debate is mostly a distraction.

LangChain, CrewAI, AutoGen, OpenAI Agents SDK. Pick whichever one your team already knows. It doesn't matter as much as you think.

What actually decides whether your agent works in production is something almost nobody talks about on this sub, and it isn't in the framework.

Here's what I've seen kill more agents than every framework bug combined.

The agent gets stuck in a loop. It calls the same tool 200 times in 4 minutes because something downstream returned ambiguous data and the LLM decided to retry forever. Your OpenAI bill goes from $3 a day to $400 in one afternoon. By the time you notice you've burned a grand. You can't even tell which agent did it because there's no audit trail.

Your VPS reboots overnight for kernel patches. Every agent that was mid-task loses everything. Tomorrow morning the support agent has no memory of yesterday's tickets, the research crew has forgotten what they were investigating, the pipeline agent restarts from scratch. None of these are framework problems. They're memory and state problems.

A customer complains the agent gave them wrong info three days ago. You go to debug. There's no record of what the agent saw, what it decided, or which tool calls it made. The framework didn't log that because frameworks aren't observability tools. You shrug and refund.

You scaled to 15 agents working together. Two of them have conflicting beliefs about the same customer because their memory isn't shared. The customer gets two different answers in the same conversation depending on which agent replies first.

You've been around enough times to realize the part you actually need isn't in the framework at all.

What I think the real stack is.

The framework just orchestrates LLM calls. Use whatever your team likes. It's the cheap layer.

A persistent memory layer that survives crashes, restarts, and redeploys, so the agent has actual continuity. This is the layer that decides whether your agent is a toy or a product.

Loop detection at the runtime layer, not bolted on as a wrapper around the framework. Something that catches your agent making the same call too many times in a row and stops it before the bill explodes.

An audit trail of every decision the agent made, with a hash chain so you can prove later what happened when the customer pushes back. Screenshots and logs aren't enough when ten thousand dollars is on the line.

Shared memory between agents in the same team so they're not having different conversations about the same customer.

Cost tracking per agent so you actually know which one ran away with your budget.

When I look at what makes the agents that survive production look different from the ones that died, it's never that they picked the right framework. It's that they had this layer underneath, either built carefully in-house or borrowed from somewhere.

Full disclosure I'm building one of these tools. There are others. Mem0 and Zep and Letta in the memory space. Helicone and LangSmith in the observability space. Mix and match. Use one or build your own. Just please stop arguing about whether LangChain or CrewAI is better when the thing eating your production agents has nothing to do with either of them.

What's been your worst production agent failure? Curious what other people have actually hit.

I built a free tool that aims to solve most of this issue, what do you think?

submitted by /u/DetectiveMindless652
[link] [comments]