I spent 96 hours setting up dual DGX Sparks and a Mac Studio M3 Ultra for the same 397B model. Neither won.

Reddit r/LocalLLaMA / 3/28/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author reports that a Mac Studio M3 Ultra began serving Qwen3.5-397B inference four hours after setup, while dual DGX Sparks only became ready after four days due to multiple operational failures (network/IP volatility, stale container build, OOM crashes, runaway config recursion, and non-interactive sudo issues).
  • In generation speed, both platforms are effectively tied at about 27–29 tokens/second across context lengths for Qwen3.5-397B, producing indistinguishable output for the author’s tests.
  • For prefill (long-prompt handling), the DGX Sparks are significantly faster (730 tok/s at 4K vs 317 tok/s on the Mac), making them feel more responsive for long-context workloads.
  • Embedding throughput surprised the author: the Mac Studio outperformed the Sparks (112 sentences/s vs 76.6/s) because embedding is memory-bandwidth bound and the M3 Ultra’s bandwidth advantage outweighed the expected CUDA benefits.
  • The author avoided using “exo” due to incompatible quantization formats, the MoE model’s unpredictable memory access over networks, and practical workflow constraints around running background RAG alongside inference.

Follow up to my last post comparing these two platforms. This time I am documenting what actually happened during the first week with both machines running simultaneously. To the people complaining that I am not doing like-for-like comparison to that I say these are not like for like products so I am optimizing my deployment for both of them individually. This post will go into more detail about what results I got and how they changed my thinking.

The gap that tells you everything

The Mac Studio was serving Qwen3.5-397B inference four hours after I plugged it in. The DGX Sparks took four days. I hit five distinct categories of failure: ephemeral IPs that vanish on reboot, a stale container build that was three days old (ancient history on the bleeding edge), OOM crashes that required binary searching memory allocation in 0.1GB increments, a recursive symlink that turned 1.9MB of config into 895MB on S3, and non interactive sudo silently failing every automated step. Each one of those is its own war story. I heard of others saying I was doing it wrong because they got stood up in an hour, to that I say congrats and lucky.

The benchmarks nobody expected

Generation speed is a tie. Both platforms deliver 27 to 29 tok/s across all context lengths on Qwen3.5-397B. You cannot tell the difference reading the output.

Prefill is where the Sparks dominate. 730 tok/s at 4K vs the Mac's 317. Blackwell's tensor cores eat large prompts like a little sampler plate at Applebee's. If you dump long conversations or documents into context, the Sparks feel noticeably snappier.

Here is the surprise: embedding throughput (Qwen3-Embedding-8B) went to the Mac Studio. 112 sentences/s vs the Spark's 76.6. Embedding is purely memory bandwidth bound. The M3 Ultra's 819 GB/s crushes 273 GB/s per Spark node. I expected CUDA to win this and it did not. That said, it didn't win by as much as I anticipated relooking at the numbers.

Why I did not use exo

I know people will ask. Four reasons: I run different quantizations on each platform (INT4 AutoRound vs 6 bit, cannot split inference across incompatible formats), the 397B MoE has unpredictable memory access patterns that do not split cleanly over a network link, combining them for inference would kill my ability to run background RAG jobs, and exo does not support INT4 AutoRound or MoE architectures well. The engineering is brilliant. It just solves a different problem than one I was presented with.

The architecture I discovered

My original plan was to benchmark embedding throughput and return the loser. The Mac won embedding. By my own criteria the Sparks should have gone back.

But speed was not the real issue I was solving for. Isolation was. Running batch embedding on the Mac while it serves a 397B model introduces memory contention, thermal throttling, and inference degradation. The Sparks give me dedicated hardware for RAG (embedding, reranking, vector search, BM25) that never touches inference memory. Yes I am killing a fly with a flamethrower but I have the funds and bandwidth to support these devices.

Mac Studio = pure inference appliance, full 512GB for the model. Sparks = always on RAG engine running embedding and reranking in the background. Query comes in, Sparks retrieve and rerank, send chunks to the Mac, Mac generates at 29 tok/s. The architecture was not designed. It was discovered through failure.

What is in the full writeup

The detailed failure narratives for all five categories above, the full benchmark tables across every context length, and the reasoning behind why the friction actually forced a better architecture than I would have designed on purpose.

Full article: https://open.substack.com/pub/alooftwaffle/p/96-hours-with-dual-dgx-sparks-and

Happy to answer questions. Last post generated some great discussion and I learned from it.

submitted by /u/trevorbg
[link] [comments]