I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Reddit r/LocalLLaMA / 4/21/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A benchmark compared 21 local LLMs on a MacBook Air M5 under identical conditions, measuring both code correctness (pass@1 on 164 EvalPlus HumanEval+ problems) and inference speed (tokens per second).
Qwen 3.6 35B-A3B (MoE) was the top performer for code quality at 89.6% while also achieving strong speed (16.9 tok/s), showing that active parameters drive real-time performance.
For practical “best value per RAM,” Qwen 2.5 Coder 7B delivered 84.2% accuracy at 11.3 tok/s while fitting in about 4.5 GB VRAM, making it a good daily coding assistant on 8 GB-class setups.
The results for Gemma 4 were unexpectedly low, especially for its MoE variant, suggesting that factors like quantization (Q4_K_M) or the HumanEval+ problem distribution may disadvantage Gemma 4.
Phi 4 Mini (3.8B) stood out as a “sleeper pick,” reaching 70.7% at 19.6 tok/s in just 2.5 GB, outperforming several larger models in the speed/size tradeoff.

There are plenty of "bro trust me, this model is better for coding" discussions out there. I wanted to replace the vibes with actual data: which model writes correct code and how fast does it run on real hardware, tested under identical conditions so the results are directly comparable. No cherry-picked prompts, no subjective impressions, just pass@1 on 164 coding problems with an expanded test suite.

Full Results Table

Model |HumanEval+ |Speed (tok/s) |VRAM

Qwen 3.6 35B-A3B (MoE) |89.6% |16.9 |20.1 GB

Qwen 2.5 Coder 32B |87.2% |2.5 |18.6 GB

Qwen 2.5 Coder 14B |86.6% |5.9 |8.5 GB

Qwen 2.5 Coder 7B |84.2% |11.3 |4.5 GB

Phi 4 14B |82.3% |5.3 |8.6 GB

Devstral Small 24B |81.7% |3.5 |13.5 GB

Gemma 3 27B |78.7% |3.0 |15.6 GB

Mistral Small 3.1 24B |75.6% |3.6 |13.5 GB

Gemma 3 12B |75.6% |5.7 |7.0 GB

Phi 4 Mini 3.8B |70.7% |19.6 |2.5 GB

Gemma 3 4B |64.6% |16.5 |2.5 GB

Mistral Nemo 12B |64.6% |6.9 |7.1 GB

Llama 3.1 8B |61.0% |10.8 |4.7 GB

Llama 3.2 3B |60.4% |24.1 |2.0 GB

Mistral 7B v0.3 |37.2% |11.5 |4.2 GB

Gemma 3 1B |34.2% |46.6 |0.9 GB

Llama 3.2 1B |32.9% |59.4 |0.9 GB

Gemma 4 31B |31.1% |5.5 |18.6 GB

Gemma 4 E4B |14.6% |36.7 |5.2 GB

Gemma 4 26B-A4B MoE |12.2% |16.2 |16.1 GB

Gemma 4 E2B |9.2% |29.2 |3.4 GB Notable findings

Qwen 3.6 35B-A3B is the clear winner at 89.6%, and the MoE architecture means it runs at 16.9 tok/s despite being nominally a 35B model. Active parameter count is what matters for speed; total parameter count is what matters for quality. This model threads that needle well.

Best bang-for-RAM: Qwen 2.5 Coder 7B. 84.2% at 11.3 tok/s in 4.5 GB. If you have 8 GB of RAM and want a daily coding assistant, this is probably your model.

The Gemma 4 results are surprising and worth discussing. Gemma 4 31B scores 31.1%, which is lower than Llama 3.2 1B (32.9%) and well below Gemma 3 27B (78.7%). The Gemma 4 MoE variants (26B-A4B) come in at 12.2%. I ran these multiple times to confirm. The Q4_K_M quantization may be hitting the Gemma 4 architecture harder than others, or the HumanEval+ task distribution may not favor its strengths. Open to theories. (https://www.reddit.com/r/LocalLLaMA/s/2pgedDFBYt)

Phi 4 Mini 3.8B is a sleeper pick at 70.7% and 19.6 tok/s in 2.5 GB. If you need something fast and small that still writes reasonable code, it outperforms several much larger models.

Methodology notes

EvalPlus HumanEval+ was chosen over standard HumanEval because it adds more test cases per problem, reducing the chance of models passing by luck
Each model evaluated in isolation (no concurrent processes)

Full writeup: https://medium.com/@enescingoz/i-benchmarked-21-coding-models-on-a-macbook-air-heres-which-ones-actually-write-good-code-1a59441dee14

GitHub repo (code + raw results): https://github.com/enescingoz/mac-llm-bench

HuggingFace dataset: https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon

What model should I test next? I have a few slots open for the next run and want to prioritize based on what this community is actually using. Also, if you have a Mac and want to contribute your own results on different hardware (M3, M4 Pro, M4 Max, etc.), the framework is fully open source and contributions are welcome.

submitted by /u/evoura
[link] [comments]