[D] Why evaluating only final outputs is misleading for local LLM agents

Reddit r/MachineLearning / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author argues that evaluating only a local LLM agent’s final output can be misleading because an agent may reach a correct answer while using wrong, unnecessary, or risky tools internally.
  • They emphasize that the agent’s decision process—tool selection, step efficiency, loop behavior, and whether the reasoning/traces are sensible—contains more evaluative signal than the end result.
  • They point out that many evaluation setups still focus on final answers and often rely on external API judges, which may miss trace-level issues in local environments.
  • The post describes building a small fully local evaluation workflow for agents that checks expected vs. forbidden tool usage and penalizes loops and extra steps, with Ollama used as the judge.
  • They share a GitHub repo (rubric-eval) and invite discussion on improved trace metrics for local LLM agent evaluation.

Been running local agents with Ollama + LangChain lately and noticed something kind of uncomfortable — you can get a completely correct final answer while the agent is doing absolute nonsense internally.

I’m talking about stuff like calling the wrong tool first and then “recovering,” using tools it didn’t need at all, looping a few times before converging, or even getting dangerously close to calling something it shouldn’t. And if you’re only checking the final output, all of that just… passes.

It made me realize that for agents, the output is almost the least interesting part. The process is where all the signal is.

Like imagine two agents both summarizing a document correctly. One does read → summarize in two clean steps. The other does read → search → read again → summarize → retry. Same result, but one is clearly way more efficient and way less risky. If you’re not looking at the trace, you’d treat them as equal.

So I started thinking about what actually matters to evaluate for local setups. Stuff like whether the agent picked the right tools, whether it avoided tools it shouldn’t touch, how many steps it took, whether it got stuck in loops, and whether the reasoning even makes sense. Basically judging how it got there, not just where it ended up.

I haven’t seen a lot of people talking about this on the local side specifically. Most eval setups I’ve come across still focus heavily on final answers, or assume you’re fine sending data to an external API for judging.

Curious how people here are handling this. Are you evaluating traces at all, or just outputs? And if you are, what kind of metrics are you using for things like loop detection or tool efficiency?

I actually ran into this enough that I hacked together a small local eval setup for it.

Nothing fancy, but it can:

- check tool usage (expected vs forbidden)

- penalize loops / extra steps

- run fully local (I’m using Ollama as the judge)

If anyone wants to poke at it:

https://github.com/Kareem-Rashed/rubric-eval

Would genuinely love ideas for better trace metrics

submitted by /u/MundaneAlternative47
[link] [comments]