I ran 13 blind peer evaluations today testing 10 small language models on hard frontier-level questions. Not summarization or trivia. Distributed lock debugging, Go concurrency bugs, SQL optimization, Bayesian medical diagnosis, Simpson's Paradox, Arrow's voting theorem, and survivorship bias analysis. The same difficulty level I use for GPT-5.4 and Claude Opus 4.6.
The results surprised me. I ran the numbers twice because the 8B model kept winning.
Aggregate Results Across 13 Evaluations
| Model | Params | 1st Place Wins | Top-3 Finishes | Avg Score | Worst Finish |
|---|---|---|---|---|---|
| Qwen 3 8B | 8B | 6 | 12/13 | 9.40 | 5th |
| Gemma 3 27B | 27B | 3 | 11/13 | 9.33 | 7th |
| Kimi K2.5 | 32B/1T MoE | 3 | 5/13 | 8.78 | 9th |
| Qwen 3 32B | 32B | 2 | 5/13 | 8.40 | 10th (1.00) |
| Phi-4 14B | 14B | 0 | 3/13 | 8.91 | 10th |
| Devstral Small | 24B | 0 | 1/13 | 8.82 | 8th |
| Granite 4.0 Micro | Micro | 0 | 1/13 | 8.61 | 9th |
| Llama 4 Scout | 17B/109B MoE | 0 | 1/13 | 8.57 | 10th |
| Mistral Nemo 12B | 12B | 0 | 0/13 | 8.43 | 10th |
| Llama 3.1 8B | 8B | 0 | 0/13 | 7.51 | 10th |
The headline finding: Qwen 3 8B won more evaluations than any model in the pool, including models with 4x its parameter count.
On code tasks specifically, Qwen 3 8B placed 1st on Go concurrency debugging (9.65), 1st on distributed lock analysis (9.33), and tied 1st on SQL optimization (9.66). On reasoning tasks, it placed 1st on Simpson's Paradox (9.51), 1st on investment decision theory (9.63), and 2nd on Bayesian diagnosis (9.53).
The Qwen 32B collapse. On the distributed lock debugging task (EVAL-20260315-043330), Qwen 3 32B scored 1.00 out of 10. Every other model scored above 5.5. I checked the raw response and the 32B appears to have returned a malformed or truncated output. Same model family, same API provider, same prompt. The 8B scored 9.33 on the identical task. I don't know yet whether this is an OpenRouter routing issue, a quantization artifact on the 32B, or a genuine failure mode. I'm flagging it but not drawing conclusions from one data point.
Kimi K2.5 is the dark horse. It won 3 evaluations including the 502 debugging task (9.57), Arrow's voting theorem (9.18), and survivorship bias (9.63). It's technically a 32B active / 1T MoE model, so calling it an "SLM" is generous. But it ran through OpenRouter like everything else, and its performance on practical debugging tasks was notably strong.
The bottom of the table tells a story too. Llama 3.1 8B finished last or second-to-last in 10 of 13 evaluations. It's an older model and these are hard tasks, but the gap between it and Qwen 3 8B (same parameter count) is massive: average 7.51 vs 9.40. Architecture and training data matter more than parameter count.
Methodology
This is The Multivac, a blind peer evaluation system. 10 models respond to the same question. Each model then judges all 10 responses (100 total judgments per evaluation, minus self-judgments). Models don't know which response came from which model. Rankings are computed from the peer consensus, not from a single evaluator.
Genuine limitations I want to be upfront about:
- AI judging AI has a circularity problem. These scores measure peer consensus, not ground truth. I'm working on a human baseline study to measure the correlation.
- For code tasks, I don't yet run the generated code against test suites. That's coming. For now, the peer scores assess code quality, correctness of reasoning, and edge case handling as judged by other models.
- This is one batch of 13 evaluations on one day. I wouldn't draw career decisions from it. But it's real signal.
- Some models (Qwen 32B, Kimi K2.5) returned suspiciously identical scores (8.25) on multiple reasoning evals, which may indicate truncated or templated responses. Investigating.
Individual eval results with full rankings, raw judgments, and model responses:
- Go Concurrency: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-033810
- SQL Optimization: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-034158
- 502 Debugging: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-034630
- Distributed Lock: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-043330
- LRU Cache: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-043801
- Bayesian Diagnosis: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-055905
- Simpson's Paradox: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-060532
- Investment Theory: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-061839
- Arrow's Theorem: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-062610
- Survivorship Bias: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-063934
Each folder has results.json (full judgment matrix) and report.md (human-readable report with all model responses). Download, verify, roast the methodology. That's how it improves.
Questions I genuinely want community input on:
- Qwen 3 8B vs Qwen 3 32B on the same tasks from the same family is a striking divergence. Has anyone else seen the 32B underperform the 8B on specific task types? Is this a known quantization issue through OpenRouter?
- For those running these models locally: do the rankings match your experience? Especially Gemma 3 27B placing top-3 in 11/13 evals. That feels right for reasoning but I'd like confirmation on code tasks.
- I'm adding programmatic test suites for code evals next. What frameworks do you use for automated code correctness checking? Thinking pytest with sandboxed execution.
- The peer evaluation methodology gets criticism (rightly) for being AI-judging-AI. I'm designing a human baseline study on Prolific. If you have experience running human eval studies, what sample size gave you reliable inter-rater agreement?
Full methodology and all historical data: themultivac.com
[link] [comments]




