Gemma4 was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests:
- Standard llama-bench benchmarks for raw prefill and generation speed
- Single-shot agentic coding tasks using Open Code to see how these models actually perform on real multi-step coding workflows
My pick is Qwen3.5-27B which is still the best model for local agentic coding on an 24GB card (RTX 3090/4090). It is reliable, efficient, produces the cleanest code and fits comfortably on a 4090.
| Model | Gen tok/s | Turn(correct) | Code Quality | VRAM | Max Context |
|---|---|---|---|---|---|
| Gemma4-26B-A4B | ~135 | 3rd | Weakest | ~21 GB | 256K |
| Qwen3.5-35B-A3B | ~136 | 2nd | Best structure, wrong API | ~23 GB | 200K |
| Qwen3.5-27B | ~45 | 1st | Cleanest and best overall | ~21 GB | 130K |
| Gemma4-31B | ~38 | 1st | Clean but shallow | ~24 GB | 65K |
Max Context is the largest context size that fits in VRAM with acceptable generation speed.
- MoE models are ~3x faster at generation (~135 tok/s vs ~45 tok/s) but both dense models got the complex task right on the first try. Both the MoE models needed retries.
- Qwen3.5-35B-A3B is seems to be the most verbose (32K tokens on the complex task).
- Gemma4-31B dense is context-limited in comparison to others on a 4090. Had to drop to 65K context to maintain acceptable generation speed.
- None of the models actually followed TDD despite being asked to. All claimed red-green methodology but wrote integration tests hitting the real API.
- Qwen3.5-27B produced the cleanest code (correct API model name, type hints, docstrings, pathlib). Qwen3.5-35B-A3B had the best structure but hardcoded an API key in tests and used the wrong model name.
You can find the detailed analysis notes here: https://aayushgarg.dev/posts/2026-04-05-qwen35-vs-gemma4/index.html
Happpy to discuss and understand other folks experience too.
[link] [comments]




