Jeff Geerling’s RDMA cluster benchmarks showed great generation throughput (31.9 tok/s on 4 nodes for Qwen3 235B), but I have questions about other performance aspects. Anyone with an RDMA cluster setup:
Prefill speed - Prompt processing at 32K/64K/128K context. Single node vs clustered. Does aggregate bandwidth help or does RDMA overhead eat it?
Time to first token - Latency before output starts. How does it scale with nodes?
KV cache - Does cache persist across nodes between turns? Or re-prefill every query?
Model loading - Cold-start time for 200B+ models. Single vs distributed.
Mixed hardware - Any penalty from mismatched RAM (256GB + 512GB nodes)? What about mixed chip generations (M3 Ultra + future M5 Ultra)?
Sustained generation - Does throughput hold for 4K-8K token outputs or degrade?
Currently have M3 Ultra 256GB on order, trying to understand if clustering is a real upgrade path.
Obviously if you just have reference to one data point you don’t need to help me answer all six I’m just casting a wide net
[link] [comments]
![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3618325%252F470cf6d0-e54c-4ddf-8d83-e3db9f829f2b.jpg&w=3840&q=75)
