Surprised by how capable Qwen3.5 9B is in agentic flows (CodeMode)

Reddit r/LocalLLaMA / 4/3/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author built a chat application with SSE and experimented with Cloudflare’s CodeMode tool-calling approach (used for MCPs), finding major gains for tool-call performance, context-length usage, and agentic behavior in some setups.
They report that CodeMode-style tool calling works well mainly with Claude, while other models frequently produce broken or missing tool-call formats (e.g., malformed JavaScript blocks or refusal to emit the expected execute_js wrapper).
Testing multiple models via OpenRouter, the author found MiniMax 2.5 performed best among those tried, but still not to the same “effortless” level as Claude.
When running locally on an M1 Pro, Qwen3.5 9B stood out as the only non-Claude model that reliably calls tools correctly and can self-correct on the next tool call.
The post concludes with the author’s surprise that a small model (9B) can achieve strong agentic tool-calling while remaining practical to run locally.

I've been working on my own chat application for a while now to experiment with LLMs, and get some experience with SSE. Also, it's fun to see if I can mirror functionalities being offered in "the big boy tools" like Claude Code, Copilot, ...

A while ago, CloudFlare released a blog post about CodeMode: a new and supposedly better way of letting LLMs call tools (they specifically use it for MCPs, my app provides these tools as built-in but it's basically the same thing at the end of the day).

When I implemented this, I noticed major improvements in:

tool call performance
context length usage
overall LLM agentic capabilities

However, this seemingly only applied to Claude. Most models really don't like this way of tool calling, even though it allows them much more freedom. They haven't been trained on it, and as such aren't very good at it.

Gemini for example never worked, it always output broken tool calls (wrapping in IIFE, not wrapping properly, ...). GPT-5.x most of the time refuses to even output an execute_js block (which is what triggers the tool call logic in the application).

I then tried some open source models like Step Flash 3.5 and GLM which didn't fare much better. MiniMax 2.5 was probably the best.

All models mentioned above were tested through OpenRouter.

I then decided I'd like to see how locally run models would perform - specifically, the ones that my MacBook M1 Pro could reasonably run. Qwen3.5 9B seemed like the perfect fit and is the first one I tried. It also turned out to be the last one as it works so well for me.

Qwen3.5 9B calls the tools perfectly. It doesn't make mistakes often, and when it does is smart enough to self-correct in the next tool call. This is the only model I've tried outside of Claude Sonnet 4.6 that calls the tools this way this effortlessly.

Just wanted to make this post to share my amazement, never have I experienced such a small model being so capable. Even better - I can run it completely locally and it's not horribly slow!

submitted by /u/dylantestaccount
[link] [comments]