Comparing Qwen3.5 vs Gemma4 for Local Agentic Coding

Reddit r/LocalLLaMA / 4/5/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The post benchmarks Google’s Gemma4 (released April 2) against Qwen3.5 for local agentic coding, using both llama-bench speed tests and single-shot multi-step coding tasks via Open Code.
  • Across models, MoE variants generate substantially faster (~135 tok/s vs ~45 tok/s), though they both produced correct solutions on the complex task with retries required.
  • For practical “local agentic coding” on a 24GB GPU class (e.g., RTX 3090/4090), the author’s top recommendation is Qwen3.5-27B due to reliability, efficient performance, and the cleanest overall code quality.
  • The benchmarks highlight trade-offs between throughput and usable context length: Gemma4-31B required reducing context to ~65K to keep generation speed acceptable, while Qwen3.5 variants supported larger contexts.
  • Despite being prompted for test-driven development, none of the models followed the requested red-green/TDD pattern; Qwen3.5-27B had the best adherence to correct API usage and code hygiene (type hints/docstrings/pathlib).

Gemma4 was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests:

  • Standard llama-bench benchmarks for raw prefill and generation speed
  • Single-shot agentic coding tasks using Open Code to see how these models actually perform on real multi-step coding workflows

My pick is Qwen3.5-27B which is still the best model for local agentic coding on an 24GB card (RTX 3090/4090). It is reliable, efficient, produces the cleanest code and fits comfortably on a 4090.

Model Gen tok/s Turn(correct) Code Quality VRAM Max Context
Gemma4-26B-A4B ~135 3rd Weakest ~21 GB 256K
Qwen3.5-35B-A3B ~136 2nd Best structure, wrong API ~23 GB 200K
Qwen3.5-27B ~45 1st Cleanest and best overall ~21 GB 130K
Gemma4-31B ~38 1st Clean but shallow ~24 GB 65K

Max Context is the largest context size that fits in VRAM with acceptable generation speed.

  • MoE models are ~3x faster at generation (~135 tok/s vs ~45 tok/s) but both dense models got the complex task right on the first try. Both the MoE models needed retries.
  • Qwen3.5-35B-A3B is seems to be the most verbose (32K tokens on the complex task).
  • Gemma4-31B dense is context-limited in comparison to others on a 4090. Had to drop to 65K context to maintain acceptable generation speed.
  • None of the models actually followed TDD despite being asked to. All claimed red-green methodology but wrote integration tests hitting the real API.
  • Qwen3.5-27B produced the cleanest code (correct API model name, type hints, docstrings, pathlib). Qwen3.5-35B-A3B had the best structure but hardcoded an API key in tests and used the wrong model name.

You can find the detailed analysis notes here: https://aayushgarg.dev/posts/2026-04-05-qwen35-vs-gemma4/index.html

Happpy to discuss and understand other folks experience too.

submitted by /u/garg-aayush
[link] [comments]