[Round 2 - Followup] M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. (thank you for the feedback)

Reddit r/LocalLLaMA / 3/22/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The author posts Round 2 benchmarks for the Apple M5 Max 128GB, updating the prior test with changes informed by community feedback.
Changes from v1 include added prompt processing (PP) speed, fair quant comparison (Q4 vs Q4, Q6 vs Q6), a new Q8_0 quantization test, usage of llama-bench for standardized measurements, and the inclusion of an MoE model (35B-A3B).
System specifications are detailed (Chip: Apple M5 Max; CPU: 18-core (12P + 6E); GPU: 40-core; Neural Engine: 16-core; Memory: 128GB; Storage: 4TB; OS: macOS 26.3.1; llama.cpp: v8420; MLX: v0.31.1; Benchmark tool: llama-bench).
The results emphasize that prompt processing speed is where the M5 Max shows its real advantage over the M4.

This is a followup from the post I made last night, where I posted results from some tests on my new laptop. I took in everyones feedback and re-tooled to perform another round of benchmark tests to hopefully address the concerns, applying the advise and suggestions and adjusting the methodology accordingly.

I know going into this that I am on the wrong side of the Dunning Kruger graph, and I am afforded the invaluable luxury of standing on the shoulders of the work of everyone here, allowing me to to avoid spending too much time mired in the 'valley of despair'.

Here's round 2.

Apple M5 Max LLM Benchmark Results (v2)

Follow-up benchmarks addressing community feedback from r/LocalLLaMA.

Changes from v1:

Added prompt processing (PP) speed — the M5's biggest improvement
Fair quant comparison — Q4 vs Q4, Q6 vs Q6
Added Q8_0 quantization test
Used llama-bench for standardized measurements
Added MoE model (35B-A3B)

System Specs

Component	Specification
Chip	Apple M5 Max
CPU	18-core (12P + 6E)
GPU	40-core Metal (MTLGPUFamilyApple10, Metal4)
Neural Engine	16-core
Memory	128GB unified
Memory Bandwidth	614 GB/s
GPU Memory Allocated	128,849 MB (full allocation via sysctl)
Storage	4TB NVMe SSD
OS	macOS 26.3.1
llama.cpp	v8420 (ggml 0.9.8, build 7f2cbd9a4)
MLX	v0.31.1 + mlx-lm v0.31.1
Benchmark tool	llama-bench (3 repetitions per test)

Results: Prompt Processing (PP) — The M5's Real Advantage

This is what people asked for. PP speed is where the M5 Max shines over M4.

Model	Size	Quant	PP 512 (tok/s)	PP 2048 (tok/s)	PP 8192 (tok/s)
Qwen 3.5 35B-A3B MoE	28.0 GiB	Q6_K	2,845	2,265	2,063
DeepSeek-R1 8B	6.3 GiB	Q6_K	1,919	1,775	1,186
Qwen 3.5 122B-A10B MoE	69.1 GiB	Q4_K_M	1,011	926	749
Qwen 3.5 27B	26.7 GiB	Q8_0	557	450	398
Qwen 3.5 27B	21.5 GiB	Q6_K	513	410	373
Qwen 3.5 27B	15.9 GiB	Q4_K_M	439	433	411
Gemma 3 27B	20.6 GiB	Q6_K	409	420	391
Qwen 2.5 72B	59.9 GiB	Q6_K	145	140	—

Key finding: The 35B-A3B MoE model achieves 2,845 tok/s PP — that's 5.5x faster than the dense 27B at the same quant level. MoE + M5 Max compute is a killer combination for prompt processing.

Results: Token Generation (TG) — Bandwidth-Bound

Rank	Model	Size	Quant	Engine	TG 128 (tok/s)
1	Qwen 3.5 35B-A3B MoE	28.0 GiB	Q6_K	llama.cpp	92.2
2	DeepSeek-R1 8B	6.3 GiB	Q6_K	llama.cpp	68.2
3	Qwen 3.5 122B-A10B MoE	69.1 GiB	Q4_K_M	llama.cpp	41.5
4	MLX Qwen 3.5 27B	~16 GiB	4bit	MLX	31.6
4	Qwen 3.5 27B	15.9 GiB	Q4_K_M	llama.cpp	24.3
5	Gemma 3 27B	20.6 GiB	Q6_K	llama.cpp	20.0
6	Qwen 3.5 27B	21.5 GiB	Q6_K	llama.cpp	19.0
7	Qwen 3.5 27B	26.7 GiB	Q8_0	llama.cpp	17.1
8	Qwen 2.5 72B	59.9 GiB	Q6_K	llama.cpp	7.9

Fair MLX vs llama.cpp Comparison (Corrected)

v1 incorrectly compared MLX 4-bit against llama.cpp Q6_K. Here's the corrected comparison at equivalent quantization:

Engine	Quant	Model Size	TG tok/s	PP 512 tok/s
MLX	4-bit	~16 GiB	31.6	—
llama.cpp	Q4_K_M	15.9 GiB	24.3	439
llama.cpp	Q6_K	21.5 GiB	19.0	513
llama.cpp	Q8_0	26.7 GiB	17.1	557

Corrected finding: MLX is 30% faster than llama.cpp at equivalent 4-bit quantization (31.6 vs 24.3 tok/s). The original v1 claim of "92% faster" was comparing different quant levels (4-bit vs 6-bit) — unfair and misleading. Apologies for that.

Note: MLX 4-bit quantization quality may differ from GGUF Q4_K_M. GGUF K-quants use mixed precision (important layers kept at higher precision), while MLX 4-bit is more uniform. Community consensus suggests GGUF Q4_K_M may produce better quality output than MLX 4-bit at similar file sizes.

Quantization Impact on Qwen 3.5 27B

Same model, different quantizations — isolating the effect of quant level:

Quant	Size	TG tok/s	PP 512	PP 8192	Quality
Q4_K_M	15.9 GiB	24.3	439	411	Good
Q6_K	21.5 GiB	19.0	513	373	Very good
Q8_0	26.7 GiB	17.1	557	398	Near-lossless

Observation: TG speed scales inversely with model size (bandwidth-bound). PP speed is interesting — Q8_0 is fastest for short prompts (more compute headroom) but Q4_K_M holds up better at long prompts (less memory pressure).

MoE Performance: The Standout Result

The Qwen 3.5 35B-A3B MoE model is the surprise performer:

Metric	35B-A3B MoE (Q6_K)	27B Dense (Q6_K)	MoE Advantage
PP 512	2,845 tok/s	513 tok/s	5.5x
PP 8192	2,063 tok/s	373 tok/s	5.5x
TG 128	92.2 tok/s	19.0 tok/s	4.8x
Model size	28.0 GiB	21.5 GiB	1.3x larger

Despite being 30% larger on disk, the MoE model is nearly 5x faster because only 3B parameters are active per token. On unified memory, there's no PCIe bottleneck for expert selection — all experts are equally accessible. This is where Apple Silicon's unified memory architecture truly shines for MoE models.

Memory Bandwidth Efficiency

TG speed correlates with bandwidth / model_size:

Model	Size (GiB)	Theoretical (tok/s)	Actual (tok/s)	Efficiency
DeepSeek-R1 8B Q6_K	6.3	97.5	68.2	70%
Qwen 3.5 27B Q4_K_M	15.9	38.6	24.3	63%
Qwen 3.5 27B Q6_K	21.5	28.6	19.0	66%
Qwen 3.5 27B Q8_0	26.7	23.0	17.1	74%
Gemma 3 27B Q6_K	20.6	29.8	20.0	67%
Qwen 2.5 72B Q6_K	59.9	10.2	7.9	77%
Qwen 3.5 35B-A3B MoE*	28.0 (3B active)	~204	92.2	45%**

*MoE effective memory read is much smaller than total model size
**MoE efficiency calculation is different — active parameters drive the bandwidth formula, not total model size

Comparison with Other Apple Silicon

Using llama-bench standardized measurements (Qwen 3.5 27B Q6_K, PP 512):

Chip	GPU Cores	Bandwidth	PP 512 (tok/s)	TG 128 (tok/s)	Source
M1 Max	32	400 GB/s	~200 (est.)	~14	Community
M4 Max	40	546 GB/s	~350 (est.)	~19	Community
M5 Max	40	614 GB/s	513	19.0	This benchmark

TG improvement M4→M5 is modest (~10%, proportional to bandwidth increase). PP improvement is reportedly much larger (~3x from M4, driven by compute improvements), though we don't have standardized M4 PP numbers to compare directly.

Methodology

Tool: llama-bench (3 repetitions, mean +/- std reported)
Config: -ngl 99 -fa 1 (full GPU offload, flash attention on)
PP tests: 512, 2048, 8192 token prompts
TG test: 128 token generation
MLX: Custom Python benchmark (5 prompt types, 300 max tokens)
Each model loaded fresh (cold start, no prompt caching)
All GGUF from bartowski (imatrix quantizations) except DeepSeek (unsloth)

122B-A10B MoE Results

The community's most requested test. 122B parameters, 10B active per token, Q4_K_M quantization, 69GB on disk.

Metric	122B-A10B MoE (Q4_K_M)	35B-A3B MoE (Q6_K)	27B Dense (Q6_K)	72B Dense (Q6_K)
PP 512	1,011 tok/s	2,845 tok/s	513 tok/s	145 tok/s
PP 2048	926 tok/s	2,265 tok/s	410 tok/s	140 tok/s
PP 8192	749 tok/s	2,063 tok/s	373 tok/s	—
TG 128	41.5 tok/s	92.2 tok/s	19.0 tok/s	7.9 tok/s
Model size	69.1 GiB	28.0 GiB	21.5 GiB	59.9 GiB
Total params	122B	35B	27B	72B
Active params	10B	3B	27B	72B

Key takeaway: A 122B model running at 41.5 tok/s on a laptop. That's faster than the dense 27B (19 tok/s) despite having 4.5x more total parameters. MoE + unified memory is the killer combination for Apple Silicon.

122B vs 72B dense: The 122B MoE is 5.3x faster at token generation (41.5 vs 7.9) and 7x faster at prompt processing (1,011 vs 145) than the 72B dense model, while being only 15% larger on disk (69 vs 60 GiB). And it benchmarks better on most tasks.

What's Next

BF16 27B test (baseline quality reference)
Context length scaling tests (8K → 32K → 128K)
Concurrent request benchmarks
MLX PP measurement (needs different tooling)
Comparison with Strix Halo (community requested)

Date

2026-03-21

v1 post: r/LocalLLaMA — thanks for the feedback that made this v2 possible.

submitted by /u/affenhoden
[link] [comments]

Self-Refining Agents in Spec-Driven Development

Dev.to

How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)

Dev.to

Agentforce Builder: How to Build AI Agents in Salesforce

Dev.to

How AI Consulting Services Support Staff Development in Dubai

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

[Round 2 - Followup] M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. (thank you for the feedback)

Key Points

Apple M5 Max LLM Benchmark Results (v2)

System Specs

Results: Prompt Processing (PP) — The M5's Real Advantage

Results: Token Generation (TG) — Bandwidth-Bound

Fair MLX vs llama.cpp Comparison (Corrected)

Quantization Impact on Qwen 3.5 27B

MoE Performance: The Standout Result

Memory Bandwidth Efficiency

Comparison with Other Apple Silicon

Methodology

122B-A10B MoE Results

What's Next

Date

Related Articles

Self-Refining Agents in Spec-Driven Development

How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)

Agentforce Builder: How to Build AI Agents in Salesforce

How AI Consulting Services Support Staff Development in Dubai

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer