(Note: Several people in the SLM results thread asked for Qwen 3.5 models. This delivers on that.)
People in my SLM results thread asked for Qwen 3.5 numbers. Ran 8 Qwen models head-to-head across 11 hard evaluations: survivorship bias, Arrow's impossibility theorem, Kelly criterion, Simpson's Paradox (construct exact numbers), Bayesian probability, LRU cache with TTL, Node.js 502 debugging, SQL optimization, Go concurrency bugs, distributed lock race conditions, and a baseline string reversal.
Same methodology as the SLM batch. Every model sees the same prompt. Every response is blind-judged by the other models in the pool. 412 valid judgments out of 704 total.
Results:
| Rank | Model | Gen | Active Params | Avg Score | Wins | Top 3 | Avg σ |
|---|---|---|---|---|---|---|---|
| 1 | Qwen 3 32B | 3.0 | 32B (dense) | 9.63 | 0 | 5/6 | 0.47 |
| 2 | Qwen 3.5 397B-A17B | 3.5 | 17B (MoE) | 9.40 | 4 | 6/10 | 0.56 |
| 3 | Qwen 3.5 122B-A10B | 3.5 | 10B (MoE) | 9.30 | 2 | 6/9 | 0.47 |
| 4 | Qwen 3.5 35B-A3B | 3.5 | 3B (MoE) | 9.20 | 4 | 6/9 | 0.69 |
| 5 | Qwen 3.5 27B | 3.5 | 27B | 9.11 | 1 | 4/10 | 0.68 |
| 6 | Qwen 3 8B | 3.0 | 8B (dense) | 8.69 | 0 | 4/11 | 0.97 |
| 7 | Qwen 3 Coder Next | 3.0 | — | 8.45 | 0 | 2/11 | 0.84 |
| 8 | Qwen 3.5 9B | 3.5 | 9B | 8.19 | 0 | 0/7 | 1.06 |
Three findings I did not expect:
- The previous-gen Qwen 3 32B (dense) outscored every Qwen 3.5 MoE model. The 0.23-point gap over the 397B flagship is meaningful when the total spread is 1.44. I expected the flagship to dominate. It did not.
- Qwen 3.5 35B-A3B won 4 evals with only 3 billion active parameters. Same number of wins as the 397B flagship. It scored a perfect 10.00 on Simpson's Paradox. For anyone running Qwen locally on consumer hardware, this model punches absurdly above its active weight.
- Qwen 3 Coder Next, the coding specialist, ranked 7th overall at 8.45. Below every general-purpose model except the 9B. It lost to general models on Go concurrency (9.09 vs 9.77 for 122B-A10B), distributed locks (9.14 vs 9.74 for 397B-A17B), and SQL optimization (9.38 vs 9.55 for 397B-A17B).
Efficiency data (for the r/LocalLLM crowd who will see this):
| Model | Avg Time (s) | Score/sec | Avg Score |
|---|---|---|---|
| Qwen 3 Coder Next | 16.9 | 0.87 | 8.45 |
| Qwen 3.5 35B-A3B | 25.3 | 0.54 | 9.20 |
| Qwen 3.5 122B-A10B | 33.1 | 0.52 | 9.30 |
| Qwen 3.5 397B-A17B | 51.0 | 0.36 | 9.40 |
| Qwen 3 32B | 96.7 | 0.31 | 9.63 |
| Qwen 3.5 9B | 39.1 | 0.26 | 8.19 |
| Qwen 3.5 27B | 83.2 | 0.22 | 9.11 |
| Qwen 3 8B | 156.1 | 0.15 | 8.69 |
Sweet spot: 35B-A3B at 0.54 pts/sec. Fastest: Coder Next at 0.87 but 7th in quality. The quality leader (32B) takes 97 seconds average, which rules it out for anything interactive.
What I do not know and want to be honest about:
Only 58.5% of judgments were valid (412 of 704). The 41.5% failure rate is a data quality problem. I checked whether invalid judgments would flip the order by simulating recovery with the strict-judge average. The top 2 positions held, but ranks 3-5 are within the noise margin.
The judge pool had a clean generational split: every Qwen 3 model judged leniently (avg 9.50+), every Qwen 3.5 model judged strictly (avg 8.25). I do not know if this is a calibration artifact or a genuine difference in how these generations evaluate quality. It adds noise.
Qwen 3 32B appeared in only 6 of 11 evals (API failures on the others). Its higher average may partly reflect a smaller, easier sample. Caveat accordingly.
Questions:
- For people running Qwen 3 32B locally: does it consistently outperform 3.5 models in your experience? Or is this an API routing artifact?
- Anyone running 35B-A3B on consumer GPUs? With 3B active parameters it should be fast on a 3090/4090. What throughput are you getting?
- The dense-vs-MoE result is interesting. On hard multi-step reasoning, dense 32B beat every MoE model. Is this because MoE routing does not select the right experts for novel reasoning chains? Or is the Qwen 3 training data just better?
- The coding specialist losing to general models on code: has anyone else seen this pattern with other "coder" branded models?
Full raw data for all 11 evals, every model response, every judgment: github.com/themultivac/multivac-evaluation
Writeup with analysis: open.substack.com/pub/themultivac/p/qwen-3-32b-outscored-every-qwen-35
[link] [comments]




