Gemma4 26b & E4B are crazy good, and replaced Qwen for me!

Reddit r/LocalLLaMA / 4/16/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author describes a multi-model local LLM setup using Llama-swap, Open WebUI, and a Claude Code router with semantic routing handled by Qwen 3.5 4B, plus several Qwen variants for chat, reasoning, math, and coding.
They report recurring problems with Qwen 3.5 4B as the semantic router, including incorrect model selection for simple prompts and occasional failure to respect override keywords like “quick” and “ultrathink.”
They also highlight performance tradeoffs: the Qwen 3.5 27B reasoning model sometimes overthinks and burns many tokens, while larger 122B models can be slower and may fail tool calls.
After switching to Gemma 4, specifically replacing the semantic router with Gemma 4 E4B, they say the routing issues disappeared and request routing has been consistently correct.
Overall, the author claims the updated stack replaced ChatGPT and reduced tool-calling and routing failures encountered previously.

My pre-gemma 4 setup was as follows:

Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory

Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where needed:

Qwen 3.5 30b A3B Q8XL - For general chat, basic document tasks, web search, anything huge context that didn't require reasoning. It's also hardcoded to use this model when my latest query contains "quick"

Qwen 3.5 27b Q8XL - used as a "higher precision" model to sit in for A3B, especially when reasoning was needed. All simple math and summarization tasks were used by this. It's also hardcoded to use this model when my latest query contains "think"

Qwen 3 Next Coder 80B A3B Q6_K - For code generation (seemed to have better outputs, but 122b was better at debugging existing code)

Qwen 3.5 122b UD Q4KXL (no reasoning) - Anything that requires more real world knowledge out of the box

Qwen 3.5 122b Q6 (reasoning) - Reserved for the most complex queries that require reasoning skills and more general knowledge than Qwen 3.5 27b. It's also hardcoded to use this model when my latest query contains "ultrathink"

This system was really solid, but the weak point was at the semantic routing layer. Qwen 3.5 4B sometimes would just straight up pick the wrong model for the job sometimes, and it was getting annoying. Even simple greetings like "Hello" and "Who are you?" Qwen 3.5 4B would assign to the reasoning models and usually the 122b non-reasoning. It also would sometimes completely ignore my "ultrathink" or "quick" override keywords, No matter the prompting on the semantic router (each model had several paragraphs on what use cases to assign it too, highlighting it's strengths and weaknesses, etc) I ended up having to hardcode the keywords in the router script.

The second weak point was that the 27b model sometimes had very large token burn for thinking tokens, even on simpler math problems (basic PEMDAS) it would overthink, even with optimal sampling parameters. The 122b model would be much better about thinking time but had slower generation output. For Claude Code Router, the 122b models sometimes would also fail tool calls where the lighter Qwen models were better (maybe unsloth quantization issues?)

Anyway, this setup completely replaced ChatGPT for me, and most Claude code cases which was surprising. I dealt with the semantic router issues just by manually changing models with the keywords when the router didn't get it right.

But when Gemma 4 came out, soooo many issues were solved.

First and foremost, I replaced the Qwen 3.5 4B semantic router with Gemma 4 E4B. This instantly fixed my semantic routing issue and now I have had zero complaints. So far it's perfectly routed each request to the models I would have chosen and have it prompted for (which Qwen 3.5 4B commonly failed). I even disabled thinking and it still works like a charm and is lightning fast at picking a model. The quality for this task specifically matches Qwen 3.5 9B with reasoning on, which I couldn't afford to spend that much memory and time for routing specifically.

Secondly, I replaced both Qwen 3.5 30B A3B and Qwen 3.5 27B with Gemma 4 26b. For the tasks that normally would be routed to either of those models, it absolutely exceeds my expectations. Basic tasks, Image tasks, mathematics and very light scripting tasks are significantly better. It sometimes even beats out the Qwen3 Next Coder and 122b models for very specific coding tasks, like frontend HTML design and modifications. Large context also has been rocking.

The best part about Gemma 4 26b is the fact that it's super efficient with it's thinking tokens. I have yet to have an issue with infinite or super lengthy / repetitive output generation. It seems very confident with its answers and rarely starts over outside of a couple double-checks. Sometimes on super simple tasks it doesn't even think at all!

So now my setup is the following:

Gemma 4 E4B for semantic routing

Gemma 4 26b (reasoning off) - For general chat, extremely basic tasks, simple followup questions with existing data/outputs, etc.

Gemma 4 26b (reasoning on) - Anything that remotely requires reasoning, simple math and summarization tasks. It's also hardcoded to use this model when my latest query contains "think". Also primarily for extremely simple HTML/JavaScript UI stuff and/or python scripts

Qwen 3 Next Coder 80B A3B Q6_K - For all other code generation

Qwen 3.5 122b UD Q4KXL (no reasoning) - Anything that requires more real world knowledge out of the box

Qwen 3.5 122b Q6 (reasoning) - Reserved for the most complex queries that require reasoning skills and more general knowledge than Gemma 4. It's also hardcoded to use this model when my latest query contains "ultrathink"

I'm super happy with the results. Historically Gemma models never really impressed me but this one really did well in my book!

submitted by /u/maxwell321
[link] [comments]