M5 Max vs M3 Max Inference Benchmarks (Qwen3.5, oMLX, 128GB, 40 GPU cores)

Key Points

The article compares identical local inference benchmarks of three Qwen 3.5 models on 16-inch MacBook Pro systems with 40 GPU cores and 128GB unified memory using oMLX v0.2.23.

M5 Max vs M3 Max Inference Benchmarks (Qwen3.5, oMLX, 128GB, 40 GPU cores)

Ran identical benchmarks on both 16” MacBook Pros with 40 GPU cores and 128GB unified memory across three Qwen 3.5 models (122B-A10B MoE, 35B-A3B MoE, 27B dense) using oMLX v0.2.23.

Quick numbers at pp1024/tg128:

35B-A3B: 134.5 vs 80.3 tg tok/s (1.7x)
122B-A10B: 65.3 vs 46.1 tg tok/s (1.4x)
27B dense: 32.8 vs 23.0 tg tok/s (1.4x)

The gap widens at longer contexts. At 65K, the 27B dense drops to 6.8 tg tok/s on M3 Max vs 19.6 on M5 Max (2.9x). Prefill advantages are even larger, up to 4x at long context, driven by the M5 Max’s GPU Neural Accelerators.

Batching matters most for agentic workloads. M5 Max scales to 2.54x throughput at 4x batch on the 35B-A3B, while M3 Max batching on dense models degrades (0.80x at 2x batch on the 122B). The 614 GB/s vs 400 GB/s bandwidth gap is significant for multi-step agent loops or parallel tool calls.

MoE efficiency is another takeaway. The 122B model (10B active) generates faster than the 27B dense on both machines. Active parameter count determines speed, not model size.

Full interactive breakdown with all charts and data: https://claude.ai/public/artifacts/c9fba245-e734-4b3b-be44-a6cabdec6f8f

submitted by /u/onil_gova
[link] [comments]