I ran a pretty simple but revealing local-LLM test.
At first I was only going to post about the two Qwens and Gemma4 and go to bed, and what do you know, I go on reddit and see a post that Qwen 3.6-27B dropped. Oh well...
Models tested:
- Gemma4
cyankiwi/gemma-4-31B-it-AWQ-4bit
- Qwen3.6-35B
RedHatAI/Qwen3.6-35B-A3B-NVFP4
- Qwen3.5-27B
QuantTrio/Qwen3.5-27B-AWQ
- Qwen3.6-27B
cyankiwi/Qwen3.6-27B-AWQ-INT4
Context: I’m working on fairly complex tool that takes noisy evidence and turns it into a structured “truth report.”
I gave the same Hermes writing agent (“Scribe”) the same task:
take 2 architecture blueprint docs (v1 baseline + v2 expansion) describing the "truth engine" and produce a unified `Masterplan.md` explaining:
- what the product is
- the user problem
- UX/product shape
- UVP/moat
- pipeline
- agent roles
- architecture
- trust/legal/provenance posture
- what changed between plan V1 and V2
V1: ~16k tokens,
V2: ~4.6k tokens,
Combined: ~20.6k tokens
Then I ran the full workflow locally on my RTX 5090 all 4 models:
- **Gemma4**
- **Qwen3.6-35B**
- **Qwen3.5-27B**
- **Qwen3.6-27B**
To make it fair and push the models, each model got:
initial draft
second-pass revision
final polish
Each stage was directed and reviewed by my GPT-5.4 agent Manny, so this wasn’t just “ask once and compare vibes.”
## What I/Manny scored
- **Clarity**
- **Completeness**
- **Discipline**
- **Usefulness**
## Final results
### Clarity
- Gemma4: **9.4**
- Qwen3.6-27B: **8.8**
- Qwen3.6-35B: **8.1**
- Qwen3.5-27B: **7.4**
**Winner: Gemma4** (at a cost, read further below)
Gemma was the best editor. Cleanest structure, best pacing, strongest restraint.
---
### Completeness
- Qwen3.6-35B: **9.6**
- Qwen3.5-27B: **9.1**
- Qwen3.6-27B: **8.7**
- Gemma4: **7.9**
**Winner: Qwen3.6-35B**
The 35B Qwen wrote the most exhaustive architecture doc by far. Best sourcebook, most implementation mass.
---
### Discipline
- Gemma4: **9.5**
- Qwen3.6-27B: **8.6**
- Qwen3.6-35B: **7.7**
- Qwen3.5-27B: **6.8**
**Winner: Gemma4**
Gemma best preserved the actual product identity
---
### Usefulness
- Qwen3.6-27B: **9.3**
- Qwen3.6-35B: **9.2**
- Gemma4: **8.9**
- Qwen3.5-27B: **8.8**
**Winner: Qwen3.6-27B**
This was the surprise. The 27B Qwen 3.6 ended up as the best **overall practical workhorse** — better balance of depth, readability, and usability than the others.
## Final ranking
1. **Qwen3.6-27B** — best all-around balance
**Gemma4** — best editor / strategist
**Qwen3.6-35B** — best exhaustive drafter
**Qwen3.5-27B** — solid, but clearly behind the others for this task
1) Best overall balance
Qwen3.6-27B This is the new interesting winner.
It doesn’t beat Gemma4 on clarity or discipline.
It doesn’t beat Qwen3.6-35B on completeness.
But it wins the thing that matters most for a real working master plan: balance. It’s the best compromise between:
- readability
- completeness
- structure
- practical usefulness
2) Best editor / best strategist
Gemma4 If the goal is:
- cleanest finished document
- strongest executive readability
- best restraint
- best “this feels like a real deliberate plan”
Then Gemma still wins.
3) Best exhaustive architecture quarry
Qwen3.6-35B If the goal is:
- maximum implementation mass
- biggest architecture sourcebook
- richest mining material for downstream docs
Then Qwen3.6-35B is still the beast.
4) Fourth place
Qwen3.5-27B Not bad. Not embarrassing.
But now clearly behind both Qwen3.6 variants and Gemma for this kind of long-form architecture/planning task.
## Actual takeaway
This ended up being a really clean split:
- **Gemma4 = best editor**
- **Qwen3.6-35B = best expander**
- **Qwen3.6-27B = best practical default**
- **Qwen3.5-27B = respectable, but not the winner**
So if I were setting a default local writing worker for long-form architecture/master-plan work today, I’d probably choose:
**Qwen3.6-27B*\*
It’s the best compromise between:
- readability
- completeness
- structure
- practical usefulness
Personal Note re Gemma 4: It was drastically shorter than the Qwens for the final output
- Gemma4 → 147 lines
- Qwen3.6-35B → 725 lines
- Qwen3.5-27B → 840 lines
- Qwen3.6-27B → 555 lines
So while I do agree that less is often more, I found the Gemma4 output lacking in both technical depth and detail. Sure, it captured the core concepts, but I would position the output as more of a pitching deck or high level concept, technical details and concepts however are sorely missing.
On the other end of the spectrum is Qwen3.6-35B which delivered 5x the volume. That document could really serve as a technical blueprint and architecture implementation bible. Qwen3.5-27B produced even more but this was quantity over quality.
I would honestly have rated Gemma4 less favourably than Manny did, so make of that what you will.
For First-draft only performance, I’d rank them:
One-shot ranking
- Qwen3.6-27B
- Qwen3.6-35B
- Qwen3.5-27B
- Gemma4
Why
1) Qwen3.6-27B
Best balance right out of the gate:
- strong product framing
- solid structure
- good density
- less bloated than the other Qwens
- more complete than Gemma’s first draft
This was the best raw first shot.
2) Qwen3.6-35B
Very strong one-shot draft, but more sprawling:
- most exhaustive
- richest implementation mass
- more likely to over-include
- better sourcebook than polished masterplan on first pass
If you want maximum raw material, this one was a beast.
3) Qwen3.5-27B
Good first-draft generator, but sloppier:
- ambitious
- broad
- lots of content
- weaker discipline and coherence than the 3.6 models
Still useful, but clearly behind both 3.6 variants.
4) Gemma4
Gemma (arguably) won the final polished-document contest, but not the first-draft contest. Its one-shot behaviour was:
- too compressed
- too selective
- not thorough enough for the initial task
It needed the later revision passes to get more substance. Depending on the audience, this may be either good or bad.
Short version
- Best one-shot: Qwen3.6-27B
- Best after revision/polish: Gemma4
[link] [comments]




