How do you test AI agents in production? The unpredictability is overwhelming.[D]

Reddit r/MachineLearning / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The author describes moving from traditional QA (asserting a deterministic output) to testing a production LLM-based agent where results and reasoning paths are non-deterministic across runs.
They explain why common approaches like snapshot testing, regex/keyword matching, and manual human evaluation are inadequate for catching reasoning mistakes and scaling reliably.
They note that scoring-rubric evaluations help somewhat but still lack clear, defensible pass/fail thresholds for production readiness.
The author is looking for an integration-test-like method that can verify intermediate reasoning/tool-use steps without overfitting expected outputs or relying on an LLM judge that could add its own failure modes.
Because the agent is embedded in a real product, incorrect decisions have concrete consequences, making rigorous, automated verification especially urgent.

I’ve been in QA for almost a decade. My mental model for quality was always: given input X, assert output Y. Now I’m on a team that’s shipping an LLM-based agent that handles multi-step tasks. I genuinely do not know how to test this in a way that feels rigorous.

The thing works. But the output isn’t deterministic. The same input can produce different reasoning chains across runs. Hell even with temp=0 I see variation in tool selection and intermediate steps. My normal instincts don’t map here. I can’t write an assertion and run it a thousand times to track flakiness. I’m at a loss for what to do.

Snapshot testing on final outputs is too brittle. If there’s a correct response that’s worded differently it breaks the test. Regex/keyword matching on outputs misses reasoning errors that accidentally land on the correct answer. Human eval isn’t automatable and doesn’t scale. Evals with a scoring rubric almost works but I don’t have a way to set pass/fail thresholds.

I want something conceptually equivalent to integration tests for reasoning steps. Like, given this tool result does the next step correctly incorporate it? I don’t know how to make that assertion without either hardcoding expected outputs or using another LLM as a judge, which would introduce a new failure mode into my test suite.

The agent runs inside our product. There are real uses and actual consequences when it makes a bad call.

Is there a framework that allows for verifying of agentic reasoning?

submitted by /u/this_aint_taliya
[link] [comments]