Follow up to my last post comparing these two platforms. This time I am documenting what actually happened during the first week with both machines running simultaneously. To the people complaining that I am not doing like-for-like comparison to that I say these are not like for like products so I am optimizing my deployment for both of them individually. This post will go into more detail about what results I got and how they changed my thinking.
The gap that tells you everything
The Mac Studio was serving Qwen3.5-397B inference four hours after I plugged it in. The DGX Sparks took four days. I hit five distinct categories of failure: ephemeral IPs that vanish on reboot, a stale container build that was three days old (ancient history on the bleeding edge), OOM crashes that required binary searching memory allocation in 0.1GB increments, a recursive symlink that turned 1.9MB of config into 895MB on S3, and non interactive sudo silently failing every automated step. Each one of those is its own war story. I heard of others saying I was doing it wrong because they got stood up in an hour, to that I say congrats and lucky.
The benchmarks nobody expected
Generation speed is a tie. Both platforms deliver 27 to 29 tok/s across all context lengths on Qwen3.5-397B. You cannot tell the difference reading the output.
Prefill is where the Sparks dominate. 730 tok/s at 4K vs the Mac's 317. Blackwell's tensor cores eat large prompts like a little sampler plate at Applebee's. If you dump long conversations or documents into context, the Sparks feel noticeably snappier.
Here is the surprise: embedding throughput (Qwen3-Embedding-8B) went to the Mac Studio. 112 sentences/s vs the Spark's 76.6. Embedding is purely memory bandwidth bound. The M3 Ultra's 819 GB/s crushes 273 GB/s per Spark node. I expected CUDA to win this and it did not. That said, it didn't win by as much as I anticipated relooking at the numbers.
Why I did not use exo
I know people will ask. Four reasons: I run different quantizations on each platform (INT4 AutoRound vs 6 bit, cannot split inference across incompatible formats), the 397B MoE has unpredictable memory access patterns that do not split cleanly over a network link, combining them for inference would kill my ability to run background RAG jobs, and exo does not support INT4 AutoRound or MoE architectures well. The engineering is brilliant. It just solves a different problem than one I was presented with.
The architecture I discovered
My original plan was to benchmark embedding throughput and return the loser. The Mac won embedding. By my own criteria the Sparks should have gone back.
But speed was not the real issue I was solving for. Isolation was. Running batch embedding on the Mac while it serves a 397B model introduces memory contention, thermal throttling, and inference degradation. The Sparks give me dedicated hardware for RAG (embedding, reranking, vector search, BM25) that never touches inference memory. Yes I am killing a fly with a flamethrower but I have the funds and bandwidth to support these devices.
Mac Studio = pure inference appliance, full 512GB for the model. Sparks = always on RAG engine running embedding and reranking in the background. Query comes in, Sparks retrieve and rerank, send chunks to the Mac, Mac generates at 29 tok/s. The architecture was not designed. It was discovered through failure.
What is in the full writeup
The detailed failure narratives for all five categories above, the full benchmark tables across every context length, and the reasoning behind why the friction actually forced a better architecture than I would have designed on purpose.
Full article: https://open.substack.com/pub/alooftwaffle/p/96-hours-with-dual-dgx-sparks-and
Happy to answer questions. Last post generated some great discussion and I learned from it.
[link] [comments]



