RDMA Mac Studio cluster - performance questions beyond generation throughput

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep Analysis

共有:

Key Points

The post questions whether an RDMA Mac Studio cluster can deliver performance gains beyond reported generation throughput (e.g., 31.9 tok/s on 4 nodes for Qwen3 235B), focusing on prompt/prefill, latency, and other bottlenecks.
It asks specifically how prefill speed scales with context sizes (32K/64K/128K) and whether aggregated bandwidth helps or RDMA communication overhead offsets benefits.
It highlights key operational concerns for real deployments, including time-to-first-token scaling, KV cache persistence across nodes between turns, and the impact of distributed vs single-node model loading (cold-start time for 200B+ models).
It seeks input on whether mixed hardware configurations (unequal RAM sizes and potentially mixed chip generations) introduce penalties that reduce the value of clustering.
It also probes sustained generation behavior for longer outputs (4K–8K tokens), and whether clustering meaningfully upgrades user experience compared with single-node setups like an M3 Ultra 256GB unit.

Jeff Geerling’s RDMA cluster benchmarks showed great generation throughput (31.9 tok/s on 4 nodes for Qwen3 235B), but I have questions about other performance aspects. Anyone with an RDMA cluster setup:

Prefill speed - Prompt processing at 32K/64K/128K context. Single node vs clustered. Does aggregate bandwidth help or does RDMA overhead eat it?
Time to first token - Latency before output starts. How does it scale with nodes?
KV cache - Does cache persist across nodes between turns? Or re-prefill every query?
Model loading - Cold-start time for 200B+ models. Single vs distributed.
Mixed hardware - Any penalty from mismatched RAM (256GB + 512GB nodes)? What about mixed chip generations (M3 Ultra + future M5 Ultra)?
Sustained generation - Does throughput hold for 4K-8K token outputs or degrade?

Currently have M3 Ultra 256GB on order, trying to understand if clustering is a real upgrade path.

Obviously if you just have reference to one data point you don’t need to help me answer all six I’m just casting a wide net

submitted by /u/quietsubstrate
[link] [comments]

[Boost]

Dev.to

Managing LLM context in a real application

Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

Dev.to

RDMA Mac Studio cluster - performance questions beyond generation throughput

Key Points

Related Articles

[Boost]

Managing LLM context in a real application

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer