How do you objectively tell if your custom agent tools are actually better?

Reddit r/LocalLLaMA / 4/29/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author reports that running Qwen3.6-35B-A3B locally in a “pi agent” led to tool-use failures, including repeated `cat` calls that cause the agent to get stuck or dump large logs instead of performing targeted greps.
  • After writing custom replacement tools, they subjectively feel improvements: fewer tool calls, less blind re-reading of the same files, and faster task completion.
  • The core problem is a lack of an objective method to determine whether the new tool set is truly better versus simply benefiting from cherry-picked successful tasks.
  • They ask how others test or benchmark custom agent tool sets to measure genuine improvement.
  • The post implicitly highlights the need for evaluation criteria and experimental design (e.g., repeatable tests, metrics, and controls) when assessing agent tool performance.

I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.

I write custom tool for replacement. Feels like it helped. The agent makes fewer calls, doesn't re-read the same file blindly, and tasks seem to finish faster.

But I have zero objective way to know if it's actually better.

Maybe I'm just cherry-picking the tasks where it works.

So I'm curious — how do you test whether your tool set is genuinely improving things? Do you write benchmarks?

submitted by /u/Own_Suspect5343
[link] [comments]