I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Reddit r/LocalLLaMA / 4/21/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A benchmark compared 21 local LLMs on a MacBook Air M5 under identical conditions, measuring both code correctness (pass@1 on 164 EvalPlus HumanEval+ problems) and inference speed (tokens per second).
  • Qwen 3.6 35B-A3B (MoE) was the top performer for code quality at 89.6% while also achieving strong speed (16.9 tok/s), showing that active parameters drive real-time performance.
  • For practical “best value per RAM,” Qwen 2.5 Coder 7B delivered 84.2% accuracy at 11.3 tok/s while fitting in about 4.5 GB VRAM, making it a good daily coding assistant on 8 GB-class setups.
  • The results for Gemma 4 were unexpectedly low, especially for its MoE variant, suggesting that factors like quantization (Q4_K_M) or the HumanEval+ problem distribution may disadvantage Gemma 4.
  • Phi 4 Mini (3.8B) stood out as a “sleeper pick,” reaching 70.7% at 19.6 tok/s in just 2.5 GB, outperforming several larger models in the speed/size tradeoff.

There are plenty of "bro trust me, this model is better for coding" discussions out there. I wanted to replace the vibes with actual data: which model writes correct code and how fast does it run on real hardware, tested under identical conditions so the results are directly comparable. No cherry-picked prompts, no subjective impressions, just pass@1 on 164 coding problems with an expanded test suite.

Full Results Table

Model |HumanEval+ |Speed (tok/s) |VRAM

Qwen 3.6 35B-A3B (MoE) |89.6% |16.9 |20.1 GB

Qwen 2.5 Coder 32B |87.2% |2.5 |18.6 GB

Qwen 2.5 Coder 14B |86.6% |5.9 |8.5 GB

Qwen 2.5 Coder 7B |84.2% |11.3 |4.5 GB

Phi 4 14B |82.3% |5.3 |8.6 GB

Devstral Small 24B |81.7% |3.5 |13.5 GB

Gemma 3 27B |78.7% |3.0 |15.6 GB

Mistral Small 3.1 24B |75.6% |3.6 |13.5 GB

Gemma 3 12B |75.6% |5.7 |7.0 GB

Phi 4 Mini 3.8B |70.7% |19.6 |2.5 GB

Gemma 3 4B |64.6% |16.5 |2.5 GB

Mistral Nemo 12B |64.6% |6.9 |7.1 GB

Llama 3.1 8B |61.0% |10.8 |4.7 GB

Llama 3.2 3B |60.4% |24.1 |2.0 GB

Mistral 7B v0.3 |37.2% |11.5 |4.2 GB

Gemma 3 1B |34.2% |46.6 |0.9 GB

Llama 3.2 1B |32.9% |59.4 |0.9 GB

Gemma 4 31B |31.1% |5.5 |18.6 GB

Gemma 4 E4B |14.6% |36.7 |5.2 GB

Gemma 4 26B-A4B MoE |12.2% |16.2 |16.1 GB

Gemma 4 E2B |9.2% |29.2 |3.4 GB Notable findings

Qwen 3.6 35B-A3B is the clear winner at 89.6%, and the MoE architecture means it runs at 16.9 tok/s despite being nominally a 35B model. Active parameter count is what matters for speed; total parameter count is what matters for quality. This model threads that needle well.

Best bang-for-RAM: Qwen 2.5 Coder 7B. 84.2% at 11.3 tok/s in 4.5 GB. If you have 8 GB of RAM and want a daily coding assistant, this is probably your model.

The Gemma 4 results are surprising and worth discussing. Gemma 4 31B scores 31.1%, which is lower than Llama 3.2 1B (32.9%) and well below Gemma 3 27B (78.7%). The Gemma 4 MoE variants (26B-A4B) come in at 12.2%. I ran these multiple times to confirm. The Q4_K_M quantization may be hitting the Gemma 4 architecture harder than others, or the HumanEval+ task distribution may not favor its strengths. Open to theories. (https://www.reddit.com/r/LocalLLaMA/s/2pgedDFBYt)

Phi 4 Mini 3.8B is a sleeper pick at 70.7% and 19.6 tok/s in 2.5 GB. If you need something fast and small that still writes reasonable code, it outperforms several much larger models.

Methodology notes

  • EvalPlus HumanEval+ was chosen over standard HumanEval because it adds more test cases per problem, reducing the chance of models passing by luck
  • Each model evaluated in isolation (no concurrent processes)

Full writeup: https://medium.com/@enescingoz/i-benchmarked-21-coding-models-on-a-macbook-air-heres-which-ones-actually-write-good-code-1a59441dee14

GitHub repo (code + raw results): https://github.com/enescingoz/mac-llm-bench

HuggingFace dataset: https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon

What model should I test next? I have a few slots open for the next run and want to prioritize based on what this community is actually using. Also, if you have a Mac and want to contribute your own results on different hardware (M3, M4 Pro, M4 Max, etc.), the framework is fully open source and contributions are welcome.

submitted by /u/evoura
[link] [comments]