M5 Max Actual Pre-fill performance gains

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article discusses why Apple’s claimed “over 4x peak GPU AI compute” for M5 Pro/M5 Max may reflect short, power-bursty performance rather than sustained throughput.
  • It suggests that both AI accelerator behavior and increased power/thermal headroom contribute to the visible peak gains, making results strongest for short prompts.
  • Based on further user testing, the postulated performance “sweet spot” occurs around ~16K tokens, aligning with Apple’s own footnote testing conditions.
  • The cited test setup measures time-to-first-token using a 14B-parameter model (4-bit weights, FP16 activations) on different MacBook Pro generations with MLX/ mlx-lm, emphasizing prefill behavior.
  • The discussion notes that any speed advantages may taper off for longer prompts as the workload extends beyond the initial high-power window.
M5 Max Actual Pre-fill performance gains

I think I figured out why apple says 4x the peak GPU AI compute. It's because they load it with a bunch of power for a few seconds. So it looks like half the performance comes from AI accelerators and the other half from dumping more watts in (or the AI accelerators use more watts).

Press release:
"With a Neural Accelerator in each GPU core and higher unified memory bandwidth, M5 Pro and M5 Max are over 4x the peak GPU compute for AI compared to the previous generation."

This is good for short bursty prompts but longer ones I imagine the speed gains diminish.

After doing more tests the sweet spot is around 16K tokens, coincidentally that is what apple tested in the footnotes:

  1. Testing conducted by Apple in January and February 2026 using preproduction 16-inch MacBook Pro systems with Apple M5 Max, 18-core CPU, 40-core GPU and 128GB of unified memory, as well as production 16-inch MacBook Pro systems with Apple M4 Max, 16-core CPU, 40-core GPU and 128GB of unified memory, and production 16-inch MacBook Pro systems with Apple M1 Max, 10-core CPU, 32-core GPU and 64GB of unified memory, all configured with 8TB SSD. Time to first token measured with a 16K-token prompt using a 14-billion parameter model with 4-bit weights and FP16 activations, mlx-lm and MLX framework. Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Pro.

I did some thermal testing with 10 second cool down in between inference just for kicks as well.

submitted by /u/M5_Maxxx
[link] [comments]