How do you objectively tell if your custom agent tools are actually better?

Reddit r/LocalLLaMA / 4/29/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The author reports that running Qwen3.6-35B-A3B locally in a “pi agent” led to tool-use failures, including repeated `cat` calls that cause the agent to get stuck or dump large logs instead of performing targeted greps.
After writing custom replacement tools, they subjectively feel improvements: fewer tool calls, less blind re-reading of the same files, and faster task completion.
The core problem is a lack of an objective method to determine whether the new tool set is truly better versus simply benefiting from cherry-picked successful tasks.
They ask how others test or benchmark custom agent tool sets to measure genuine improvement.
The post implicitly highlights the need for evaluation criteria and experimental design (e.g., repeatable tests, metrics, and controls) when assessing agent tool performance.

I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.

I write custom tool for replacement. Feels like it helped. The agent makes fewer calls, doesn't re-read the same file blindly, and tasks seem to finish faster.

But I have zero objective way to know if it's actually better.

Maybe I'm just cherry-picking the tasks where it works.

So I'm curious — how do you test whether your tool set is genuinely improving things? Do you write benchmarks?

submitted by /u/Own_Suspect5343
[link] [comments]