RDMA Mac Studio cluster - performance questions beyond generation throughput

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep Analysis

Key Points

  • The post questions whether an RDMA Mac Studio cluster can deliver performance gains beyond reported generation throughput (e.g., 31.9 tok/s on 4 nodes for Qwen3 235B), focusing on prompt/prefill, latency, and other bottlenecks.
  • It asks specifically how prefill speed scales with context sizes (32K/64K/128K) and whether aggregated bandwidth helps or RDMA communication overhead offsets benefits.
  • It highlights key operational concerns for real deployments, including time-to-first-token scaling, KV cache persistence across nodes between turns, and the impact of distributed vs single-node model loading (cold-start time for 200B+ models).
  • It seeks input on whether mixed hardware configurations (unequal RAM sizes and potentially mixed chip generations) introduce penalties that reduce the value of clustering.
  • It also probes sustained generation behavior for longer outputs (4K–8K tokens), and whether clustering meaningfully upgrades user experience compared with single-node setups like an M3 Ultra 256GB unit.

Jeff Geerling’s RDMA cluster benchmarks showed great generation throughput (31.9 tok/s on 4 nodes for Qwen3 235B), but I have questions about other performance aspects. Anyone with an RDMA cluster setup:

  1. Prefill speed - Prompt processing at 32K/64K/128K context. Single node vs clustered. Does aggregate bandwidth help or does RDMA overhead eat it?

  2. Time to first token - Latency before output starts. How does it scale with nodes?

  3. KV cache - Does cache persist across nodes between turns? Or re-prefill every query?

  4. Model loading - Cold-start time for 200B+ models. Single vs distributed.

  5. Mixed hardware - Any penalty from mismatched RAM (256GB + 512GB nodes)? What about mixed chip generations (M3 Ultra + future M5 Ultra)?

  6. Sustained generation - Does throughput hold for 4K-8K token outputs or degrade?

Currently have M3 Ultra 256GB on order, trying to understand if clustering is a real upgrade path.

Obviously if you just have reference to one data point you don’t need to help me answer all six I’m just casting a wide net

submitted by /u/quietsubstrate
[link] [comments]
広告

RDMA Mac Studio cluster - performance questions beyond generation throughput | AI Navigate