AI Navigate

[Round 2 - Followup] M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. (thank you for the feedback)

Reddit r/LocalLLaMA / 3/22/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The author posts Round 2 benchmarks for the Apple M5 Max 128GB, updating the prior test with changes informed by community feedback.
  • Changes from v1 include added prompt processing (PP) speed, fair quant comparison (Q4 vs Q4, Q6 vs Q6), a new Q8_0 quantization test, usage of llama-bench for standardized measurements, and the inclusion of an MoE model (35B-A3B).
  • System specifications are detailed (Chip: Apple M5 Max; CPU: 18-core (12P + 6E); GPU: 40-core; Neural Engine: 16-core; Memory: 128GB; Storage: 4TB; OS: macOS 26.3.1; llama.cpp: v8420; MLX: v0.31.1; Benchmark tool: llama-bench).
  • The results emphasize that prompt processing speed is where the M5 Max shows its real advantage over the M4.

This is a followup from the post I made last night, where I posted results from some tests on my new laptop. I took in everyones feedback and re-tooled to perform another round of benchmark tests to hopefully address the concerns, applying the advise and suggestions and adjusting the methodology accordingly.

I know going into this that I am on the wrong side of the Dunning Kruger graph, and I am afforded the invaluable luxury of standing on the shoulders of the work of everyone here, allowing me to to avoid spending too much time mired in the 'valley of despair'.

Here's round 2.

Apple M5 Max LLM Benchmark Results (v2)

Follow-up benchmarks addressing community feedback from r/LocalLLaMA.

Changes from v1:

  • Added prompt processing (PP) speed — the M5's biggest improvement
  • Fair quant comparison — Q4 vs Q4, Q6 vs Q6
  • Added Q8_0 quantization test
  • Used llama-bench for standardized measurements
  • Added MoE model (35B-A3B)

System Specs

Component Specification
Chip Apple M5 Max
CPU 18-core (12P + 6E)
GPU 40-core Metal (MTLGPUFamilyApple10, Metal4)
Neural Engine 16-core
Memory 128GB unified
Memory Bandwidth 614 GB/s
GPU Memory Allocated 128,849 MB (full allocation via sysctl)
Storage 4TB NVMe SSD
OS macOS 26.3.1
llama.cpp v8420 (ggml 0.9.8, build 7f2cbd9a4)
MLX v0.31.1 + mlx-lm v0.31.1
Benchmark tool llama-bench (3 repetitions per test)

Results: Prompt Processing (PP) — The M5's Real Advantage

This is what people asked for. PP speed is where the M5 Max shines over M4.

Model Size Quant PP 512 (tok/s) PP 2048 (tok/s) PP 8192 (tok/s)
Qwen 3.5 35B-A3B MoE 28.0 GiB Q6_K 2,845 2,265 2,063
DeepSeek-R1 8B 6.3 GiB Q6_K 1,919 1,775 1,186
Qwen 3.5 122B-A10B MoE 69.1 GiB Q4_K_M 1,011 926 749
Qwen 3.5 27B 26.7 GiB Q8_0 557 450 398
Qwen 3.5 27B 21.5 GiB Q6_K 513 410 373
Qwen 3.5 27B 15.9 GiB Q4_K_M 439 433 411
Gemma 3 27B 20.6 GiB Q6_K 409 420 391
Qwen 2.5 72B 59.9 GiB Q6_K 145 140

Key finding: The 35B-A3B MoE model achieves 2,845 tok/s PP — that's 5.5x faster than the dense 27B at the same quant level. MoE + M5 Max compute is a killer combination for prompt processing.

Results: Token Generation (TG) — Bandwidth-Bound

Rank Model Size Quant Engine TG 128 (tok/s)
1 Qwen 3.5 35B-A3B MoE 28.0 GiB Q6_K llama.cpp 92.2
2 DeepSeek-R1 8B 6.3 GiB Q6_K llama.cpp 68.2
3 Qwen 3.5 122B-A10B MoE 69.1 GiB Q4_K_M llama.cpp 41.5
4 MLX Qwen 3.5 27B ~16 GiB 4bit MLX 31.6
4 Qwen 3.5 27B 15.9 GiB Q4_K_M llama.cpp 24.3
5 Gemma 3 27B 20.6 GiB Q6_K llama.cpp 20.0
6 Qwen 3.5 27B 21.5 GiB Q6_K llama.cpp 19.0
7 Qwen 3.5 27B 26.7 GiB Q8_0 llama.cpp 17.1
8 Qwen 2.5 72B 59.9 GiB Q6_K llama.cpp 7.9

Fair MLX vs llama.cpp Comparison (Corrected)

v1 incorrectly compared MLX 4-bit against llama.cpp Q6_K. Here's the corrected comparison at equivalent quantization:

Engine Quant Model Size TG tok/s PP 512 tok/s
MLX 4-bit ~16 GiB 31.6
llama.cpp Q4_K_M 15.9 GiB 24.3 439
llama.cpp Q6_K 21.5 GiB 19.0 513
llama.cpp Q8_0 26.7 GiB 17.1 557

Corrected finding: MLX is 30% faster than llama.cpp at equivalent 4-bit quantization (31.6 vs 24.3 tok/s). The original v1 claim of "92% faster" was comparing different quant levels (4-bit vs 6-bit) — unfair and misleading. Apologies for that.

Note: MLX 4-bit quantization quality may differ from GGUF Q4_K_M. GGUF K-quants use mixed precision (important layers kept at higher precision), while MLX 4-bit is more uniform. Community consensus suggests GGUF Q4_K_M may produce better quality output than MLX 4-bit at similar file sizes.

Quantization Impact on Qwen 3.5 27B

Same model, different quantizations — isolating the effect of quant level:

Quant Size TG tok/s PP 512 PP 8192 Quality
Q4_K_M 15.9 GiB 24.3 439 411 Good
Q6_K 21.5 GiB 19.0 513 373 Very good
Q8_0 26.7 GiB 17.1 557 398 Near-lossless

Observation: TG speed scales inversely with model size (bandwidth-bound). PP speed is interesting — Q8_0 is fastest for short prompts (more compute headroom) but Q4_K_M holds up better at long prompts (less memory pressure).

MoE Performance: The Standout Result

The Qwen 3.5 35B-A3B MoE model is the surprise performer:

Metric 35B-A3B MoE (Q6_K) 27B Dense (Q6_K) MoE Advantage
PP 512 2,845 tok/s 513 tok/s 5.5x
PP 8192 2,063 tok/s 373 tok/s 5.5x
TG 128 92.2 tok/s 19.0 tok/s 4.8x
Model size 28.0 GiB 21.5 GiB 1.3x larger

Despite being 30% larger on disk, the MoE model is nearly 5x faster because only 3B parameters are active per token. On unified memory, there's no PCIe bottleneck for expert selection — all experts are equally accessible. This is where Apple Silicon's unified memory architecture truly shines for MoE models.

Memory Bandwidth Efficiency

TG speed correlates with bandwidth / model_size:

Model Size (GiB) Theoretical (tok/s) Actual (tok/s) Efficiency
DeepSeek-R1 8B Q6_K 6.3 97.5 68.2 70%
Qwen 3.5 27B Q4_K_M 15.9 38.6 24.3 63%
Qwen 3.5 27B Q6_K 21.5 28.6 19.0 66%
Qwen 3.5 27B Q8_0 26.7 23.0 17.1 74%
Gemma 3 27B Q6_K 20.6 29.8 20.0 67%
Qwen 2.5 72B Q6_K 59.9 10.2 7.9 77%
Qwen 3.5 35B-A3B MoE* 28.0 (3B active) ~204 92.2 45%**

*MoE effective memory read is much smaller than total model size
**MoE efficiency calculation is different — active parameters drive the bandwidth formula, not total model size

Comparison with Other Apple Silicon

Using llama-bench standardized measurements (Qwen 3.5 27B Q6_K, PP 512):

Chip GPU Cores Bandwidth PP 512 (tok/s) TG 128 (tok/s) Source
M1 Max 32 400 GB/s ~200 (est.) ~14 Community
M4 Max 40 546 GB/s ~350 (est.) ~19 Community
M5 Max 40 614 GB/s 513 19.0 This benchmark

TG improvement M4→M5 is modest (~10%, proportional to bandwidth increase). PP improvement is reportedly much larger (~3x from M4, driven by compute improvements), though we don't have standardized M4 PP numbers to compare directly.

Methodology

  • Tool: llama-bench (3 repetitions, mean +/- std reported)
  • Config: -ngl 99 -fa 1 (full GPU offload, flash attention on)
  • PP tests: 512, 2048, 8192 token prompts
  • TG test: 128 token generation
  • MLX: Custom Python benchmark (5 prompt types, 300 max tokens)
  • Each model loaded fresh (cold start, no prompt caching)
  • All GGUF from bartowski (imatrix quantizations) except DeepSeek (unsloth)

122B-A10B MoE Results

The community's most requested test. 122B parameters, 10B active per token, Q4_K_M quantization, 69GB on disk.

Metric 122B-A10B MoE (Q4_K_M) 35B-A3B MoE (Q6_K) 27B Dense (Q6_K) 72B Dense (Q6_K)
PP 512 1,011 tok/s 2,845 tok/s 513 tok/s 145 tok/s
PP 2048 926 tok/s 2,265 tok/s 410 tok/s 140 tok/s
PP 8192 749 tok/s 2,063 tok/s 373 tok/s
TG 128 41.5 tok/s 92.2 tok/s 19.0 tok/s 7.9 tok/s
Model size 69.1 GiB 28.0 GiB 21.5 GiB 59.9 GiB
Total params 122B 35B 27B 72B
Active params 10B 3B 27B 72B

Key takeaway: A 122B model running at 41.5 tok/s on a laptop. That's faster than the dense 27B (19 tok/s) despite having 4.5x more total parameters. MoE + unified memory is the killer combination for Apple Silicon.

122B vs 72B dense: The 122B MoE is 5.3x faster at token generation (41.5 vs 7.9) and 7x faster at prompt processing (1,011 vs 145) than the 72B dense model, while being only 15% larger on disk (69 vs 60 GiB). And it benchmarks better on most tasks.

What's Next

  • BF16 27B test (baseline quality reference)
  • Context length scaling tests (8K → 32K → 128K)
  • Concurrent request benchmarks
  • MLX PP measurement (needs different tooling)
  • Comparison with Strix Halo (community requested)

Date

2026-03-21

v1 post: r/LocalLLaMA — thanks for the feedback that made this v2 possible.

submitted by /u/affenhoden
[link] [comments]