I've been working on my own chat application for a while now to experiment with LLMs, and get some experience with SSE. Also, it's fun to see if I can mirror functionalities being offered in "the big boy tools" like Claude Code, Copilot, ...
A while ago, CloudFlare released a blog post about CodeMode: a new and supposedly better way of letting LLMs call tools (they specifically use it for MCPs, my app provides these tools as built-in but it's basically the same thing at the end of the day).
When I implemented this, I noticed major improvements in:
- tool call performance
- context length usage
- overall LLM agentic capabilities
However, this seemingly only applied to Claude. Most models really don't like this way of tool calling, even though it allows them much more freedom. They haven't been trained on it, and as such aren't very good at it.
Gemini for example never worked, it always output broken tool calls (wrapping in IIFE, not wrapping properly, ...). GPT-5.x most of the time refuses to even output an execute_js block (which is what triggers the tool call logic in the application).
I then tried some open source models like Step Flash 3.5 and GLM which didn't fare much better. MiniMax 2.5 was probably the best.
All models mentioned above were tested through OpenRouter.
I then decided I'd like to see how locally run models would perform - specifically, the ones that my MacBook M1 Pro could reasonably run. Qwen3.5 9B seemed like the perfect fit and is the first one I tried. It also turned out to be the last one as it works so well for me.
Qwen3.5 9B calls the tools perfectly. It doesn't make mistakes often, and when it does is smart enough to self-correct in the next tool call. This is the only model I've tried outside of Claude Sonnet 4.6 that calls the tools this way this effortlessly.
Just wanted to make this post to share my amazement, never have I experienced such a small model being so capable. Even better - I can run it completely locally and it's not horribly slow!
[link] [comments]


