Comparing Qwen3.5 vs Gemma4 for Local Agentic Coding

Reddit r/LocalLLaMA / 4/5/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The post benchmarks Google’s Gemma4 (released April 2) against Qwen3.5 for local agentic coding, using both llama-bench speed tests and single-shot multi-step coding tasks via Open Code.
Across models, MoE variants generate substantially faster (~135 tok/s vs ~45 tok/s), though they both produced correct solutions on the complex task with retries required.
For practical “local agentic coding” on a 24GB GPU class (e.g., RTX 3090/4090), the author’s top recommendation is Qwen3.5-27B due to reliability, efficient performance, and the cleanest overall code quality.
The benchmarks highlight trade-offs between throughput and usable context length: Gemma4-31B required reducing context to ~65K to keep generation speed acceptable, while Qwen3.5 variants supported larger contexts.
Despite being prompted for test-driven development, none of the models followed the requested red-green/TDD pattern; Qwen3.5-27B had the best adherence to correct API usage and code hygiene (type hints/docstrings/pathlib).

Gemma4 was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests:

Standard llama-bench benchmarks for raw prefill and generation speed
Single-shot agentic coding tasks using Open Code to see how these models actually perform on real multi-step coding workflows

My pick is Qwen3.5-27B which is still the best model for local agentic coding on an 24GB card (RTX 3090/4090). It is reliable, efficient, produces the cleanest code and fits comfortably on a 4090.

Model	Gen tok/s	Turn(correct)	Code Quality	VRAM	Max Context
Gemma4-26B-A4B	~135	3rd	Weakest	~21 GB	256K
Qwen3.5-35B-A3B	~136	2nd	Best structure, wrong API	~23 GB	200K
Qwen3.5-27B	~45	1st	Cleanest and best overall	~21 GB	130K
Gemma4-31B	~38	1st	Clean but shallow	~24 GB	65K

Max Context is the largest context size that fits in VRAM with acceptable generation speed.

MoE models are ~3x faster at generation (~135 tok/s vs ~45 tok/s) but both dense models got the complex task right on the first try. Both the MoE models needed retries.
Qwen3.5-35B-A3B is seems to be the most verbose (32K tokens on the complex task).
Gemma4-31B dense is context-limited in comparison to others on a 4090. Had to drop to 65K context to maintain acceptable generation speed.
None of the models actually followed TDD despite being asked to. All claimed red-green methodology but wrote integration tests hitting the real API.
Qwen3.5-27B produced the cleanest code (correct API model name, type hints, docstrings, pathlib). Qwen3.5-35B-A3B had the best structure but hardcoded an API key in tests and used the wrong model name.

You can find the detailed analysis notes here: https://aayushgarg.dev/posts/2026-04-05-qwen35-vs-gemma4/index.html

Happpy to discuss and understand other folks experience too.

submitted by /u/garg-aayush
[link] [comments]