I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.
I write custom tool for replacement. Feels like it helped. The agent makes fewer calls, doesn't re-read the same file blindly, and tasks seem to finish faster.
But I have zero objective way to know if it's actually better.
Maybe I'm just cherry-picking the tasks where it works.
So I'm curious — how do you test whether your tool set is genuinely improving things? Do you write benchmarks?
[link] [comments]



