There are plenty of "bro trust me, this model is better for coding" discussions out there. I wanted to replace the vibes with actual data: which model writes correct code and how fast does it run on real hardware, tested under identical conditions so the results are directly comparable. No cherry-picked prompts, no subjective impressions, just pass@1 on 164 coding problems with an expanded test suite.
Full Results Table
Model |HumanEval+ |Speed (tok/s) |VRAM
Qwen 3.6 35B-A3B (MoE) |89.6% |16.9 |20.1 GB
Qwen 2.5 Coder 32B |87.2% |2.5 |18.6 GB
Qwen 2.5 Coder 14B |86.6% |5.9 |8.5 GB
Qwen 2.5 Coder 7B |84.2% |11.3 |4.5 GB
Phi 4 14B |82.3% |5.3 |8.6 GB
Devstral Small 24B |81.7% |3.5 |13.5 GB
Gemma 3 27B |78.7% |3.0 |15.6 GB
Mistral Small 3.1 24B |75.6% |3.6 |13.5 GB
Gemma 3 12B |75.6% |5.7 |7.0 GB
Phi 4 Mini 3.8B |70.7% |19.6 |2.5 GB
Gemma 3 4B |64.6% |16.5 |2.5 GB
Mistral Nemo 12B |64.6% |6.9 |7.1 GB
Llama 3.1 8B |61.0% |10.8 |4.7 GB
Llama 3.2 3B |60.4% |24.1 |2.0 GB
Mistral 7B v0.3 |37.2% |11.5 |4.2 GB
Gemma 3 1B |34.2% |46.6 |0.9 GB
Llama 3.2 1B |32.9% |59.4 |0.9 GB
Gemma 4 31B |31.1% |5.5 |18.6 GB
Gemma 4 E4B |14.6% |36.7 |5.2 GB
Gemma 4 26B-A4B MoE |12.2% |16.2 |16.1 GB
Gemma 4 E2B |9.2% |29.2 |3.4 GB Notable findings
Qwen 3.6 35B-A3B is the clear winner at 89.6%, and the MoE architecture means it runs at 16.9 tok/s despite being nominally a 35B model. Active parameter count is what matters for speed; total parameter count is what matters for quality. This model threads that needle well.
Best bang-for-RAM: Qwen 2.5 Coder 7B. 84.2% at 11.3 tok/s in 4.5 GB. If you have 8 GB of RAM and want a daily coding assistant, this is probably your model.
The Gemma 4 results are surprising and worth discussing. Gemma 4 31B scores 31.1%, which is lower than Llama 3.2 1B (32.9%) and well below Gemma 3 27B (78.7%). The Gemma 4 MoE variants (26B-A4B) come in at 12.2%. I ran these multiple times to confirm. The Q4_K_M quantization may be hitting the Gemma 4 architecture harder than others, or the HumanEval+ task distribution may not favor its strengths. Open to theories. (https://www.reddit.com/r/LocalLLaMA/s/2pgedDFBYt)
Phi 4 Mini 3.8B is a sleeper pick at 70.7% and 19.6 tok/s in 2.5 GB. If you need something fast and small that still writes reasonable code, it outperforms several much larger models.
Methodology notes
- EvalPlus HumanEval+ was chosen over standard HumanEval because it adds more test cases per problem, reducing the chance of models passing by luck
- Each model evaluated in isolation (no concurrent processes)
GitHub repo (code + raw results): https://github.com/enescingoz/mac-llm-bench
HuggingFace dataset: https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon
What model should I test next? I have a few slots open for the next run and want to prioritize based on what this community is actually using. Also, if you have a Mac and want to contribute your own results on different hardware (M3, M4 Pro, M4 Max, etc.), the framework is fully open source and contributions are welcome.
[link] [comments]



