M3 Ultra + DGX Spark = M5 Ultra-lite?

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The post reports hands-on benchmarking of exo “disaggregated prefill” using DGX Spark for prefill and a separate machine for decode, comparing it against running entirely on an M3 Ultra in llama.cpp.
Results show Spark delivers roughly 1.4x–3.4x throughput gains depending on the model (e.g., Qwen 35B and larger models benefit more), supporting claims that Spark can outperform M3 Ultra for matmul-heavy prefill.
The author finds DGX Spark slightly overkill for their simple setup and pivots toward a more focused arrangement using llama.cpp KV serialization and wrappers to move KV cache between systems.
A key tuning recommendation is to set mmap=0 in llama.cpp, because leaving the default can dramatically slow model loading and also reduce prefill speed.
The author concludes the overall system feels like an “M5 Ultra-lite” (between M5 Max and M5 Ultra) but suggests further upgrades like adding a second Spark over 200GbE and exploring vLLM for improved batching performance.

So I saw an article recently about exo disaggregated prefill with DGX Spark and M3 Ultra - prefill on one machine and decode on another. DGX Spark apparently has 4x matmul performance over an M3 Ultra - same as the M5 Ultra should have. So I got a Spark and have been playing around with it this weekend. Here are the results I've been getting with llama.cpp:

┌──────────────┬─────────────┬───────────────┬────────────┐ │ Model │ Mac pp16384 │ Spark pp16384 │ Result │ ├──────────────┼─────────────┼───────────────┼────────────┤ │ Qwen 35B A3B │ 1574 t/s │ 2198 t/s │ Spark 1.4x │ ├──────────────┼─────────────┼───────────────┼────────────┤ │ Qwen 27B │ 340 t/s │ 778 t/s │ Spark 2.3x │ ├──────────────┼─────────────┼───────────────┼────────────┤ │ Minimax M2.7 │ 372 t/s │ 763 t/s │ Spark 2.1x │ ├──────────────┼─────────────┼───────────────┼────────────┤ │ Mistral 128B │ 72 t/s │ 241 t/s │ Spark 3.4x │ └──────────────┴─────────────┴───────────────┴────────────┘

In the end I found exo a little overkill for this simple use case, and so I've got Claude building a more focused and direct setup just using llama.cpp kv serialisation, and some wrappers to handle passing over the kv cache.

For anyone who's just got a Spark or thinking of getting one: the most important thing I've found so far is to set mmap=0 for llama.cpp, otherwise it massively harms both model loading time (many minutes vs like 20 seconds) and even prefill speeds.

The Spark is tiny and low power. Good complement to the M3 Ultra for a neat, quiet package.

Of course the M3 Ultra only has ~66% of the bandwidth that the M5 Ultra will have, so decode speeds will be lower - but I'm already pretty happy with M3 decode. The M5 Ultra definitely won't be enough of a boost that I'm going to drop another $10k on it. My current setup is now somewhere between an M5 Max and M5 Ultra, but with CUDA capability.

If I upgraded anything just now, it would probably be adding a second Spark via the 200GbE!

I wonder if I can get even better performance with vllm too, especially for batching. If anyone has good info on this, can they post in here? I'll keep experimenting and keep you guys posted if people are interested.

submitted by /u/-dysangel-
[link] [comments]