I tracked what AI agents actually do when nobody's watching. Built a tool that replays every decision.

Reddit r/artificial / 4/15/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author describes a common pain point with AI agents: after they run for hours, developers often cannot tell what specific actions or decisions were taken beyond generic “task complete” logs.
They built “Octopoda,” an observability layer that records agent memory writes, decisions, and recalls on a replayable timeline so users can scrub through an agent’s behavior step by step.
Octopoda adds loop detection by tracking multiple signals (e.g., write similarity, key overwrite frequency, velocity spikes, alert frequency, and goal drift) and estimates the hourly cost of looping behaviors.
The tool includes auto-checkpoints that save snapshots every 25 writes, enabling quick rollback if the agent state becomes corrupted.
It integrates with popular agent frameworks (LangChain, CrewAI, AutoGen, and the OpenAI Agents SDK) and provides a real-time dashboard with health scores, cost per agent, shared memory views, and a full audit trail.

I tracked what AI agents actually do when nobody's watching. Built a tool that replays every decision.

Been building AI agents for about a year now and the thing that always drove me crazy is you deploy an agent, it runs for hours, and you have absolutely no idea what it did. The logs say "task complete" 47 times but did it actually do 47 different things or did it just loop the same task over and over?

I had an agent burn through about $340 in API credits over a weekend because it got stuck retrying the same request. The logs showed 200 OK on every call. Everything looked fine. It just kept doing the same thing for 6 hours straight while I slept.

So I built something to fix this. It's called Octopoda and its basically an observability layer that sits underneath your agents. Every memory write, every decision, every recall gets logged on a timeline. You can literally press play and watch what your agent did at 3am, step by step, like scrubbing through a video.

The part that surprised me most was the loop detection. Once I could see the full timeline I realised how often agents loop without you knowing. Not obvious infinite loops, subtle stuff. An agent that rewrites the same conclusion 8 times with slightly different wording. Or one that keeps checking the same API endpoint every 30 seconds even though the data hasn't changed. Each iteration costs tokens but produces nothing new.

We track 5 signals for this: write similarity, key overwrite frequency, velocity spikes, alert frequency, and goal drift. When enough signals fire together it flags it and estimates how much money the loop is costing you per hour. One user had a research agent that was wasting about $10 an hour on duplicate writes before the detection caught it.

It also does auto-checkpoints. Every 25 writes it saves a snapshot automatically so if something goes wrong you can roll back to any point with one click. No more losing an entire night of agent work because something corrupted at 4am.

Works with LangChain, CrewAI, AutoGen, and OpenAI Agents SDK. One line to integrate:

The dashboard shows everything in real time. Agent health scores, cost per agent, shared memory between agents, full audit trail with reasoning for every decision. Honestly the most useful thing is just being able to answer "what happened overnight" without spending an hour reading logs.

Anyone else dealing with the "I have no idea what my agent did" problem? Curious how other people are handling observability for autonomous workflows.

Let me know if anyone wants to check it out!

submitted by /u/DetectiveMindless652
[link] [comments]