I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

Reddit r/LocalLLaMA / 4/23/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The author ran a local, side-by-side evaluation of four LLMs (Gemma 4, Qwen3.6-27B, Qwen3.6-35B-A3B, and Qwen3.5-27B) on the same architecture-document writing task using an RTX 5090.
  • The test involved generating a unified Masterplan.md from two blueprint documents (a ~16k-token V1 and a ~4.6k-token V2, combined ~20.6k tokens) and assessing outputs across clarity, completeness, discipline, and usefulness.
  • For clarity and discipline, Gemma 4 scored highest, producing the cleanest structure, best pacing, and strongest restraint in preserving the product identity.
  • For completeness, Qwen3.6-35B-A3B substantially led, generating the most exhaustive architecture document and the greatest “implementation mass.”
  • The workflow used a consistent Hermes-based writing agent (“Scribe”) with a second GPT-5.4 agent (“Manny”) to direct and review each stage (initial draft, revision, and final polish) for a more controlled comparison.

I ran a pretty simple but revealing local-LLM test.

At first I was only going to post about the two Qwens and Gemma4 and go to bed, and what do you know, I go on reddit and see a post that Qwen 3.6-27B dropped. Oh well...

Models tested:

  • Gemma4
    • cyankiwi/gemma-4-31B-it-AWQ-4bit
  • Qwen3.6-35B
    • RedHatAI/Qwen3.6-35B-A3B-NVFP4
  • Qwen3.5-27B
    • QuantTrio/Qwen3.5-27B-AWQ
  • Qwen3.6-27B
    • cyankiwi/Qwen3.6-27B-AWQ-INT4

Context: I’m working on fairly complex tool that takes noisy evidence and turns it into a structured “truth report.”

I gave the same Hermes writing agent (“Scribe”) the same task:

take 2 architecture blueprint docs (v1 baseline + v2 expansion) describing the "truth engine" and produce a unified `Masterplan.md` explaining:

- what the product is

- the user problem

- UX/product shape

- UVP/moat

- pipeline

- agent roles

- architecture

- trust/legal/provenance posture

- what changed between plan V1 and V2

V1: ~16k tokens,

V2: ~4.6k tokens,

Combined: ~20.6k tokens

Then I ran the full workflow locally on my RTX 5090 all 4 models:

- **Gemma4**
- **Qwen3.6-35B**
- **Qwen3.5-27B**
- **Qwen3.6-27B**

To make it fair and push the models, each model got:

  1. initial draft

  2. second-pass revision

  3. final polish

Each stage was directed and reviewed by my GPT-5.4 agent Manny, so this wasn’t just “ask once and compare vibes.”

## What I/Manny scored

- **Clarity**

- **Completeness**

- **Discipline**

- **Usefulness**

## Final results

### Clarity

- Gemma4: **9.4**

- Qwen3.6-27B: **8.8**

- Qwen3.6-35B: **8.1**

- Qwen3.5-27B: **7.4**

**Winner: Gemma4** (at a cost, read further below)

Gemma was the best editor. Cleanest structure, best pacing, strongest restraint.

---

### Completeness

- Qwen3.6-35B: **9.6**

- Qwen3.5-27B: **9.1**

- Qwen3.6-27B: **8.7**

- Gemma4: **7.9**

**Winner: Qwen3.6-35B**

The 35B Qwen wrote the most exhaustive architecture doc by far. Best sourcebook, most implementation mass.

---

### Discipline

- Gemma4: **9.5**

- Qwen3.6-27B: **8.6**

- Qwen3.6-35B: **7.7**

- Qwen3.5-27B: **6.8**

**Winner: Gemma4**

Gemma best preserved the actual product identity

---

### Usefulness

- Qwen3.6-27B: **9.3**

- Qwen3.6-35B: **9.2**

- Gemma4: **8.9**

- Qwen3.5-27B: **8.8**

**Winner: Qwen3.6-27B**

This was the surprise. The 27B Qwen 3.6 ended up as the best **overall practical workhorse** — better balance of depth, readability, and usability than the others.

## Final ranking

1. **Qwen3.6-27B** — best all-around balance

  1. **Gemma4** — best editor / strategist

  2. **Qwen3.6-35B** — best exhaustive drafter

  3. **Qwen3.5-27B** — solid, but clearly behind the others for this task

1) Best overall balance

Qwen3.6-27B This is the new interesting winner.

It doesn’t beat Gemma4 on clarity or discipline.
It doesn’t beat Qwen3.6-35B on completeness.

But it wins the thing that matters most for a real working master plan: balance. It’s the best compromise between:

  • readability
  • completeness
  • structure
  • practical usefulness

2) Best editor / best strategist

Gemma4 If the goal is:

  • cleanest finished document
  • strongest executive readability
  • best restraint
  • best “this feels like a real deliberate plan”

Then Gemma still wins.

3) Best exhaustive architecture quarry

Qwen3.6-35B If the goal is:

  • maximum implementation mass
  • biggest architecture sourcebook
  • richest mining material for downstream docs

Then Qwen3.6-35B is still the beast.

4) Fourth place

Qwen3.5-27B Not bad. Not embarrassing.
But now clearly behind both Qwen3.6 variants and Gemma for this kind of long-form architecture/planning task.

## Actual takeaway

This ended up being a really clean split:

- **Gemma4 = best editor**

- **Qwen3.6-35B = best expander**

- **Qwen3.6-27B = best practical default**

- **Qwen3.5-27B = respectable, but not the winner**

So if I were setting a default local writing worker for long-form architecture/master-plan work today, I’d probably choose:

**Qwen3.6-27B*\*

It’s the best compromise between:

- readability

- completeness

- structure

- practical usefulness

Personal Note re Gemma 4: It was drastically shorter than the Qwens for the final output

  • Gemma4147 lines
  • Qwen3.6-35B725 lines
  • Qwen3.5-27B840 lines
  • Qwen3.6-27B555 lines

So while I do agree that less is often more, I found the Gemma4 output lacking in both technical depth and detail. Sure, it captured the core concepts, but I would position the output as more of a pitching deck or high level concept, technical details and concepts however are sorely missing.
On the other end of the spectrum is Qwen3.6-35B which delivered 5x the volume. That document could really serve as a technical blueprint and architecture implementation bible. Qwen3.5-27B produced even more but this was quantity over quality.
I would honestly have rated Gemma4 less favourably than Manny did, so make of that what you will.

For First-draft only performance, I’d rank them:

One-shot ranking

  1. Qwen3.6-27B
  2. Qwen3.6-35B
  3. Qwen3.5-27B
  4. Gemma4

Why

1) Qwen3.6-27B

Best balance right out of the gate:

  • strong product framing
  • solid structure
  • good density
  • less bloated than the other Qwens
  • more complete than Gemma’s first draft

This was the best raw first shot.

2) Qwen3.6-35B

Very strong one-shot draft, but more sprawling:

  • most exhaustive
  • richest implementation mass
  • more likely to over-include
  • better sourcebook than polished masterplan on first pass

If you want maximum raw material, this one was a beast.

3) Qwen3.5-27B

Good first-draft generator, but sloppier:

  • ambitious
  • broad
  • lots of content
  • weaker discipline and coherence than the 3.6 models

Still useful, but clearly behind both 3.6 variants.

4) Gemma4

Gemma (arguably) won the final polished-document contest, but not the first-draft contest. Its one-shot behaviour was:

  • too compressed
  • too selective
  • not thorough enough for the initial task

It needed the later revision passes to get more substance. Depending on the audience, this may be either good or bad.

Short version

  • Best one-shot: Qwen3.6-27B
  • Best after revision/polish: Gemma4
submitted by /u/Gazorpazorp1
[link] [comments]