Qwen 3 8B topped 6 of 13 hard evals against models 4x its size, blind peer eval of 10 SLMs

Reddit r/LocalLLaMA / 3/16/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

Thirteen blind peer evaluations were conducted across 10 small language models on hard frontier-level questions, including Go concurrency bugs, SQL optimization, Bayesian medical diagnosis, Simpson's Paradox, and other advanced topics, at a difficulty level comparable to GPT-5.4 and Claude Opus 4.6.
Qwen 3 8B won 6 of 13 evaluations and finished in the top-3 in 12 of 13, with an average score of 9.40, outperforming larger peers in the pool.
On code tasks, Qwen 3 8B led Go concurrency debugging (9.65) and distributed lock analysis (9.33), and tied for first on SQL optimization (9.66).
Other models such as Gemma 3 27B and Kimi K2.5 performed notably worse, illustrating that an 8B model can surpass some larger models on these hard tasks.
The results imply that smaller models can compete with larger ones on frontier challenges, potentially influencing scaling strategies and benchmarking in future model development.

I ran 13 blind peer evaluations today testing 10 small language models on hard frontier-level questions. Not summarization or trivia. Distributed lock debugging, Go concurrency bugs, SQL optimization, Bayesian medical diagnosis, Simpson's Paradox, Arrow's voting theorem, and survivorship bias analysis. The same difficulty level I use for GPT-5.4 and Claude Opus 4.6.

The results surprised me. I ran the numbers twice because the 8B model kept winning.

Aggregate Results Across 13 Evaluations

Model	Params	1st Place Wins	Top-3 Finishes	Avg Score	Worst Finish
Qwen 3 8B	8B	6	12/13	9.40	5th
Gemma 3 27B	27B	3	11/13	9.33	7th
Kimi K2.5	32B/1T MoE	3	5/13	8.78	9th
Qwen 3 32B	32B	2	5/13	8.40	10th (1.00)
Phi-4 14B	14B	0	3/13	8.91	10th
Devstral Small	24B	0	1/13	8.82	8th
Granite 4.0 Micro	Micro	0	1/13	8.61	9th
Llama 4 Scout	17B/109B MoE	0	1/13	8.57	10th
Mistral Nemo 12B	12B	0	0/13	8.43	10th
Llama 3.1 8B	8B	0	0/13	7.51	10th

The headline finding: Qwen 3 8B won more evaluations than any model in the pool, including models with 4x its parameter count.

On code tasks specifically, Qwen 3 8B placed 1st on Go concurrency debugging (9.65), 1st on distributed lock analysis (9.33), and tied 1st on SQL optimization (9.66). On reasoning tasks, it placed 1st on Simpson's Paradox (9.51), 1st on investment decision theory (9.63), and 2nd on Bayesian diagnosis (9.53).

The Qwen 32B collapse. On the distributed lock debugging task (EVAL-20260315-043330), Qwen 3 32B scored 1.00 out of 10. Every other model scored above 5.5. I checked the raw response and the 32B appears to have returned a malformed or truncated output. Same model family, same API provider, same prompt. The 8B scored 9.33 on the identical task. I don't know yet whether this is an OpenRouter routing issue, a quantization artifact on the 32B, or a genuine failure mode. I'm flagging it but not drawing conclusions from one data point.

Kimi K2.5 is the dark horse. It won 3 evaluations including the 502 debugging task (9.57), Arrow's voting theorem (9.18), and survivorship bias (9.63). It's technically a 32B active / 1T MoE model, so calling it an "SLM" is generous. But it ran through OpenRouter like everything else, and its performance on practical debugging tasks was notably strong.

The bottom of the table tells a story too. Llama 3.1 8B finished last or second-to-last in 10 of 13 evaluations. It's an older model and these are hard tasks, but the gap between it and Qwen 3 8B (same parameter count) is massive: average 7.51 vs 9.40. Architecture and training data matter more than parameter count.

Methodology

This is The Multivac, a blind peer evaluation system. 10 models respond to the same question. Each model then judges all 10 responses (100 total judgments per evaluation, minus self-judgments). Models don't know which response came from which model. Rankings are computed from the peer consensus, not from a single evaluator.

Genuine limitations I want to be upfront about:

AI judging AI has a circularity problem. These scores measure peer consensus, not ground truth. I'm working on a human baseline study to measure the correlation.
For code tasks, I don't yet run the generated code against test suites. That's coming. For now, the peer scores assess code quality, correctness of reasoning, and edge case handling as judged by other models.
This is one batch of 13 evaluations on one day. I wouldn't draw career decisions from it. But it's real signal.
Some models (Qwen 32B, Kimi K2.5) returned suspiciously identical scores (8.25) on multiple reasoning evals, which may indicate truncated or templated responses. Investigating.

Individual eval results with full rankings, raw judgments, and model responses:

Go Concurrency: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-033810
SQL Optimization: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-034158
502 Debugging: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-034630
Distributed Lock: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-043330
LRU Cache: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-043801
Bayesian Diagnosis: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-055905
Simpson's Paradox: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-060532
Investment Theory: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-061839
Arrow's Theorem: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-062610
Survivorship Bias: https://github.com/themultivac/multivac-evaluation/tree/main/data/evaluations/EVAL-20260315-063934

Each folder has results.json (full judgment matrix) and report.md (human-readable report with all model responses). Download, verify, roast the methodology. That's how it improves.

Questions I genuinely want community input on:

Qwen 3 8B vs Qwen 3 32B on the same tasks from the same family is a striking divergence. Has anyone else seen the 32B underperform the 8B on specific task types? Is this a known quantization issue through OpenRouter?
For those running these models locally: do the rankings match your experience? Especially Gemma 3 27B placing top-3 in 11/13 evals. That feels right for reasoning but I'd like confirmation on code tasks.
I'm adding programmatic test suites for code evals next. What frameworks do you use for automated code correctness checking? Thinking pytest with sandboxed execution.
The peer evaluation methodology gets criticism (rightly) for being AI-judging-AI. I'm designing a human baseline study on Prolific. If you have experience running human eval studies, what sample size gave you reliable inter-rater agreement?

Full methodology and all historical data: themultivac.com

submitted by /u/Silver_Raspberry_811
[link] [comments]

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

Ledge.ai

AIに心を持たせる試みについて

note

#2 : プロンプト研究講座【第17回】プロンプトの「温度感」と「湿度感」の表現

note

国内AIエージェント動向(2026/3/18号)

note

AIが参照する記事は意外と少ない

note

Qwen 3 8B topped 6 of 13 hard evals against models 4x its size, blind peer eval of 10 SLMs

Key Points

Related Articles

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

AIに心を持たせる試みについて

#2 : プロンプト研究講座【第17回】プロンプトの「温度感」と「湿度感」の表現

国内AIエージェント動向(2026/3/18号)

AIが参照する記事は意外と少ない

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

富士通、日本初の防衛テックアクセラレータ開始 防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

AIに心を持たせる試みについて

#2 : プロンプト研究講座【第17回】プロンプトの「温度感」と「湿度感」の表現

国内AIエージェント動向(2026/3/18号)

AIが参照する記事は意外と少ない

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像