This is a followup from the post I made last night, where I posted results from some tests on my new laptop. I took in everyones feedback and re-tooled to perform another round of benchmark tests to hopefully address the concerns, applying the advise and suggestions and adjusting the methodology accordingly.
I know going into this that I am on the wrong side of the Dunning Kruger graph, and I am afforded the invaluable luxury of standing on the shoulders of the work of everyone here, allowing me to to avoid spending too much time mired in the 'valley of despair'.
Here's round 2.
Apple M5 Max LLM Benchmark Results (v2)
Follow-up benchmarks addressing community feedback from r/LocalLLaMA.
Changes from v1:
- Added prompt processing (PP) speed — the M5's biggest improvement
- Fair quant comparison — Q4 vs Q4, Q6 vs Q6
- Added Q8_0 quantization test
- Used llama-bench for standardized measurements
- Added MoE model (35B-A3B)
System Specs
| Component | Specification |
|---|---|
| Chip | Apple M5 Max |
| CPU | 18-core (12P + 6E) |
| GPU | 40-core Metal (MTLGPUFamilyApple10, Metal4) |
| Neural Engine | 16-core |
| Memory | 128GB unified |
| Memory Bandwidth | 614 GB/s |
| GPU Memory Allocated | 128,849 MB (full allocation via sysctl) |
| Storage | 4TB NVMe SSD |
| OS | macOS 26.3.1 |
| llama.cpp | v8420 (ggml 0.9.8, build 7f2cbd9a4) |
| MLX | v0.31.1 + mlx-lm v0.31.1 |
| Benchmark tool | llama-bench (3 repetitions per test) |
Results: Prompt Processing (PP) — The M5's Real Advantage
This is what people asked for. PP speed is where the M5 Max shines over M4.
| Model | Size | Quant | PP 512 (tok/s) | PP 2048 (tok/s) | PP 8192 (tok/s) |
|---|---|---|---|---|---|
| Qwen 3.5 35B-A3B MoE | 28.0 GiB | Q6_K | 2,845 | 2,265 | 2,063 |
| DeepSeek-R1 8B | 6.3 GiB | Q6_K | 1,919 | 1,775 | 1,186 |
| Qwen 3.5 122B-A10B MoE | 69.1 GiB | Q4_K_M | 1,011 | 926 | 749 |
| Qwen 3.5 27B | 26.7 GiB | Q8_0 | 557 | 450 | 398 |
| Qwen 3.5 27B | 21.5 GiB | Q6_K | 513 | 410 | 373 |
| Qwen 3.5 27B | 15.9 GiB | Q4_K_M | 439 | 433 | 411 |
| Gemma 3 27B | 20.6 GiB | Q6_K | 409 | 420 | 391 |
| Qwen 2.5 72B | 59.9 GiB | Q6_K | 145 | 140 | — |
Key finding: The 35B-A3B MoE model achieves 2,845 tok/s PP — that's 5.5x faster than the dense 27B at the same quant level. MoE + M5 Max compute is a killer combination for prompt processing.
Results: Token Generation (TG) — Bandwidth-Bound
| Rank | Model | Size | Quant | Engine | TG 128 (tok/s) |
|---|---|---|---|---|---|
| 1 | Qwen 3.5 35B-A3B MoE | 28.0 GiB | Q6_K | llama.cpp | 92.2 |
| 2 | DeepSeek-R1 8B | 6.3 GiB | Q6_K | llama.cpp | 68.2 |
| 3 | Qwen 3.5 122B-A10B MoE | 69.1 GiB | Q4_K_M | llama.cpp | 41.5 |
| 4 | MLX Qwen 3.5 27B | ~16 GiB | 4bit | MLX | 31.6 |
| 4 | Qwen 3.5 27B | 15.9 GiB | Q4_K_M | llama.cpp | 24.3 |
| 5 | Gemma 3 27B | 20.6 GiB | Q6_K | llama.cpp | 20.0 |
| 6 | Qwen 3.5 27B | 21.5 GiB | Q6_K | llama.cpp | 19.0 |
| 7 | Qwen 3.5 27B | 26.7 GiB | Q8_0 | llama.cpp | 17.1 |
| 8 | Qwen 2.5 72B | 59.9 GiB | Q6_K | llama.cpp | 7.9 |
Fair MLX vs llama.cpp Comparison (Corrected)
v1 incorrectly compared MLX 4-bit against llama.cpp Q6_K. Here's the corrected comparison at equivalent quantization:
| Engine | Quant | Model Size | TG tok/s | PP 512 tok/s |
|---|---|---|---|---|
| MLX | 4-bit | ~16 GiB | 31.6 | — |
| llama.cpp | Q4_K_M | 15.9 GiB | 24.3 | 439 |
| llama.cpp | Q6_K | 21.5 GiB | 19.0 | 513 |
| llama.cpp | Q8_0 | 26.7 GiB | 17.1 | 557 |
Corrected finding: MLX is 30% faster than llama.cpp at equivalent 4-bit quantization (31.6 vs 24.3 tok/s). The original v1 claim of "92% faster" was comparing different quant levels (4-bit vs 6-bit) — unfair and misleading. Apologies for that.
Note: MLX 4-bit quantization quality may differ from GGUF Q4_K_M. GGUF K-quants use mixed precision (important layers kept at higher precision), while MLX 4-bit is more uniform. Community consensus suggests GGUF Q4_K_M may produce better quality output than MLX 4-bit at similar file sizes.
Quantization Impact on Qwen 3.5 27B
Same model, different quantizations — isolating the effect of quant level:
| Quant | Size | TG tok/s | PP 512 | PP 8192 | Quality |
|---|---|---|---|---|---|
| Q4_K_M | 15.9 GiB | 24.3 | 439 | 411 | Good |
| Q6_K | 21.5 GiB | 19.0 | 513 | 373 | Very good |
| Q8_0 | 26.7 GiB | 17.1 | 557 | 398 | Near-lossless |
Observation: TG speed scales inversely with model size (bandwidth-bound). PP speed is interesting — Q8_0 is fastest for short prompts (more compute headroom) but Q4_K_M holds up better at long prompts (less memory pressure).
MoE Performance: The Standout Result
The Qwen 3.5 35B-A3B MoE model is the surprise performer:
| Metric | 35B-A3B MoE (Q6_K) | 27B Dense (Q6_K) | MoE Advantage |
|---|---|---|---|
| PP 512 | 2,845 tok/s | 513 tok/s | 5.5x |
| PP 8192 | 2,063 tok/s | 373 tok/s | 5.5x |
| TG 128 | 92.2 tok/s | 19.0 tok/s | 4.8x |
| Model size | 28.0 GiB | 21.5 GiB | 1.3x larger |
Despite being 30% larger on disk, the MoE model is nearly 5x faster because only 3B parameters are active per token. On unified memory, there's no PCIe bottleneck for expert selection — all experts are equally accessible. This is where Apple Silicon's unified memory architecture truly shines for MoE models.
Memory Bandwidth Efficiency
TG speed correlates with bandwidth / model_size:
| Model | Size (GiB) | Theoretical (tok/s) | Actual (tok/s) | Efficiency |
|---|---|---|---|---|
| DeepSeek-R1 8B Q6_K | 6.3 | 97.5 | 68.2 | 70% |
| Qwen 3.5 27B Q4_K_M | 15.9 | 38.6 | 24.3 | 63% |
| Qwen 3.5 27B Q6_K | 21.5 | 28.6 | 19.0 | 66% |
| Qwen 3.5 27B Q8_0 | 26.7 | 23.0 | 17.1 | 74% |
| Gemma 3 27B Q6_K | 20.6 | 29.8 | 20.0 | 67% |
| Qwen 2.5 72B Q6_K | 59.9 | 10.2 | 7.9 | 77% |
| Qwen 3.5 35B-A3B MoE* | 28.0 (3B active) | ~204 | 92.2 | 45%** |
*MoE effective memory read is much smaller than total model size
**MoE efficiency calculation is different — active parameters drive the bandwidth formula, not total model size
Comparison with Other Apple Silicon
Using llama-bench standardized measurements (Qwen 3.5 27B Q6_K, PP 512):
| Chip | GPU Cores | Bandwidth | PP 512 (tok/s) | TG 128 (tok/s) | Source |
|---|---|---|---|---|---|
| M1 Max | 32 | 400 GB/s | ~200 (est.) | ~14 | Community |
| M4 Max | 40 | 546 GB/s | ~350 (est.) | ~19 | Community |
| M5 Max | 40 | 614 GB/s | 513 | 19.0 | This benchmark |
TG improvement M4→M5 is modest (~10%, proportional to bandwidth increase). PP improvement is reportedly much larger (~3x from M4, driven by compute improvements), though we don't have standardized M4 PP numbers to compare directly.
Methodology
- Tool:
llama-bench(3 repetitions, mean +/- std reported) - Config:
-ngl 99 -fa 1(full GPU offload, flash attention on) - PP tests: 512, 2048, 8192 token prompts
- TG test: 128 token generation
- MLX: Custom Python benchmark (5 prompt types, 300 max tokens)
- Each model loaded fresh (cold start, no prompt caching)
- All GGUF from bartowski (imatrix quantizations) except DeepSeek (unsloth)
122B-A10B MoE Results
The community's most requested test. 122B parameters, 10B active per token, Q4_K_M quantization, 69GB on disk.
| Metric | 122B-A10B MoE (Q4_K_M) | 35B-A3B MoE (Q6_K) | 27B Dense (Q6_K) | 72B Dense (Q6_K) |
|---|---|---|---|---|
| PP 512 | 1,011 tok/s | 2,845 tok/s | 513 tok/s | 145 tok/s |
| PP 2048 | 926 tok/s | 2,265 tok/s | 410 tok/s | 140 tok/s |
| PP 8192 | 749 tok/s | 2,063 tok/s | 373 tok/s | — |
| TG 128 | 41.5 tok/s | 92.2 tok/s | 19.0 tok/s | 7.9 tok/s |
| Model size | 69.1 GiB | 28.0 GiB | 21.5 GiB | 59.9 GiB |
| Total params | 122B | 35B | 27B | 72B |
| Active params | 10B | 3B | 27B | 72B |
Key takeaway: A 122B model running at 41.5 tok/s on a laptop. That's faster than the dense 27B (19 tok/s) despite having 4.5x more total parameters. MoE + unified memory is the killer combination for Apple Silicon.
122B vs 72B dense: The 122B MoE is 5.3x faster at token generation (41.5 vs 7.9) and 7x faster at prompt processing (1,011 vs 145) than the 72B dense model, while being only 15% larger on disk (69 vs 60 GiB). And it benchmarks better on most tasks.
What's Next
- BF16 27B test (baseline quality reference)
- Context length scaling tests (8K → 32K → 128K)
- Concurrent request benchmarks
- MLX PP measurement (needs different tooling)
- Comparison with Strix Halo (community requested)
Date
2026-03-21
v1 post: r/LocalLLaMA — thanks for the feedback that made this v2 possible.
[link] [comments]