AI Navigate

Qwen 3 32B outscored every Qwen 3.5 model across 11 blind evals, 3B-active-parameter model won 4

Reddit r/LocalLLaMA / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The Qwen 3 32B model (32B dense) ranked first across 11 blind evaluations, outscoring every Qwen 3.5 model with an average score of 9.63 and top-3 placements in 5 of 6 cases.
  • The study used a fixed prompt for all models and blind judgments from pool members, yielding 412 valid judgments out of 704 total.
  • The results show multiple Qwen 3.5 variants (e.g., 397B-A17B, 122B-A10B, 35B-A3B, 27B) trailing the 32B dense model, with scores ranging around 9.4 to 9.1 and several MoE configurations performing behind the dense 32B.
  • The evaluation covered a broad set of engineering tasks and challenges (including Node.js debugging, SQL optimization, Go concurrency, distributed locks, and more), illustrating the breadth of the benchmarking effort.

(Note: Several people in the SLM results thread asked for Qwen 3.5 models. This delivers on that.)

People in my SLM results thread asked for Qwen 3.5 numbers. Ran 8 Qwen models head-to-head across 11 hard evaluations: survivorship bias, Arrow's impossibility theorem, Kelly criterion, Simpson's Paradox (construct exact numbers), Bayesian probability, LRU cache with TTL, Node.js 502 debugging, SQL optimization, Go concurrency bugs, distributed lock race conditions, and a baseline string reversal.

Same methodology as the SLM batch. Every model sees the same prompt. Every response is blind-judged by the other models in the pool. 412 valid judgments out of 704 total.

Results:

Rank Model Gen Active Params Avg Score Wins Top 3 Avg σ
1 Qwen 3 32B 3.0 32B (dense) 9.63 0 5/6 0.47
2 Qwen 3.5 397B-A17B 3.5 17B (MoE) 9.40 4 6/10 0.56
3 Qwen 3.5 122B-A10B 3.5 10B (MoE) 9.30 2 6/9 0.47
4 Qwen 3.5 35B-A3B 3.5 3B (MoE) 9.20 4 6/9 0.69
5 Qwen 3.5 27B 3.5 27B 9.11 1 4/10 0.68
6 Qwen 3 8B 3.0 8B (dense) 8.69 0 4/11 0.97
7 Qwen 3 Coder Next 3.0 8.45 0 2/11 0.84
8 Qwen 3.5 9B 3.5 9B 8.19 0 0/7 1.06

Three findings I did not expect:

  1. The previous-gen Qwen 3 32B (dense) outscored every Qwen 3.5 MoE model. The 0.23-point gap over the 397B flagship is meaningful when the total spread is 1.44. I expected the flagship to dominate. It did not.
  2. Qwen 3.5 35B-A3B won 4 evals with only 3 billion active parameters. Same number of wins as the 397B flagship. It scored a perfect 10.00 on Simpson's Paradox. For anyone running Qwen locally on consumer hardware, this model punches absurdly above its active weight.
  3. Qwen 3 Coder Next, the coding specialist, ranked 7th overall at 8.45. Below every general-purpose model except the 9B. It lost to general models on Go concurrency (9.09 vs 9.77 for 122B-A10B), distributed locks (9.14 vs 9.74 for 397B-A17B), and SQL optimization (9.38 vs 9.55 for 397B-A17B).

Efficiency data (for the r/LocalLLM crowd who will see this):

Model Avg Time (s) Score/sec Avg Score
Qwen 3 Coder Next 16.9 0.87 8.45
Qwen 3.5 35B-A3B 25.3 0.54 9.20
Qwen 3.5 122B-A10B 33.1 0.52 9.30
Qwen 3.5 397B-A17B 51.0 0.36 9.40
Qwen 3 32B 96.7 0.31 9.63
Qwen 3.5 9B 39.1 0.26 8.19
Qwen 3.5 27B 83.2 0.22 9.11
Qwen 3 8B 156.1 0.15 8.69

Sweet spot: 35B-A3B at 0.54 pts/sec. Fastest: Coder Next at 0.87 but 7th in quality. The quality leader (32B) takes 97 seconds average, which rules it out for anything interactive.

What I do not know and want to be honest about:

Only 58.5% of judgments were valid (412 of 704). The 41.5% failure rate is a data quality problem. I checked whether invalid judgments would flip the order by simulating recovery with the strict-judge average. The top 2 positions held, but ranks 3-5 are within the noise margin.

The judge pool had a clean generational split: every Qwen 3 model judged leniently (avg 9.50+), every Qwen 3.5 model judged strictly (avg 8.25). I do not know if this is a calibration artifact or a genuine difference in how these generations evaluate quality. It adds noise.

Qwen 3 32B appeared in only 6 of 11 evals (API failures on the others). Its higher average may partly reflect a smaller, easier sample. Caveat accordingly.

Questions:

  1. For people running Qwen 3 32B locally: does it consistently outperform 3.5 models in your experience? Or is this an API routing artifact?
  2. Anyone running 35B-A3B on consumer GPUs? With 3B active parameters it should be fast on a 3090/4090. What throughput are you getting?
  3. The dense-vs-MoE result is interesting. On hard multi-step reasoning, dense 32B beat every MoE model. Is this because MoE routing does not select the right experts for novel reasoning chains? Or is the Qwen 3 training data just better?
  4. The coding specialist losing to general models on code: has anyone else seen this pattern with other "coder" branded models?

Full raw data for all 11 evals, every model response, every judgment: github.com/themultivac/multivac-evaluation

Writeup with analysis: open.substack.com/pub/themultivac/p/qwen-3-32b-outscored-every-qwen-35

submitted by /u/Silver_Raspberry_811
[link] [comments]