M5 Max Actual Pre-fill performance gains

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The article discusses why Apple’s claimed “over 4x peak GPU AI compute” for M5 Pro/M5 Max may reflect short, power-bursty performance rather than sustained throughput.
It suggests that both AI accelerator behavior and increased power/thermal headroom contribute to the visible peak gains, making results strongest for short prompts.
Based on further user testing, the postulated performance “sweet spot” occurs around ~16K tokens, aligning with Apple’s own footnote testing conditions.
The cited test setup measures time-to-first-token using a 14B-parameter model (4-bit weights, FP16 activations) on different MacBook Pro generations with MLX/ mlx-lm, emphasizing prefill behavior.
The discussion notes that any speed advantages may taper off for longer prompts as the workload extends beyond the initial high-power window.

M5 Max Actual Pre-fill performance gains

I think I figured out why apple says 4x the peak GPU AI compute. It's because they load it with a bunch of power for a few seconds. So it looks like half the performance comes from AI accelerators and the other half from dumping more watts in (or the AI accelerators use more watts).

Press release:
"With a Neural Accelerator in each GPU core and higher unified memory bandwidth, M5 Pro and M5 Max are over 4x the peak GPU compute for AI compared to the previous generation."

This is good for short bursty prompts but longer ones I imagine the speed gains diminish.

After doing more tests the sweet spot is around 16K tokens, coincidentally that is what apple tested in the footnotes:

Testing conducted by Apple in January and February 2026 using preproduction 16-inch MacBook Pro systems with Apple M5 Max, 18-core CPU, 40-core GPU and 128GB of unified memory, as well as production 16-inch MacBook Pro systems with Apple M4 Max, 16-core CPU, 40-core GPU and 128GB of unified memory, and production 16-inch MacBook Pro systems with Apple M1 Max, 10-core CPU, 32-core GPU and 64GB of unified memory, all configured with 8TB SSD. Time to first token measured with a 16K-token prompt using a 14-billion parameter model with 4-bit weights and FP16 activations, mlx-lm and MLX framework. Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Pro.

I did some thermal testing with 10 second cool down in between inference just for kicks as well.

submitted by /u/M5_Maxxx
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/24DailyView insight →

Interactive Web Visualization of GPT-2

Reddit r/artificial

Stop Treating AI Interview Fraud Like a Proctoring Problem

Dev.to

[R] Causal self-attention as a probabilistic model over embeddings

Reddit r/MachineLearning

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

Dev.to

InVideo AI Review: Fast Finished

Dev.to

M5 Max Actual Pre-fill performance gains

Key Points

💡 Insights using this article

Related Articles

Interactive Web Visualization of GPT-2

Stop Treating AI Interview Fraud Like a Proctoring Problem

[R] Causal self-attention as a probabilistic model over embeddings

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

InVideo AI Review: Fast Finished

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer