[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The benchmark demonstrates that llama.cpp RPC can pool VRAM across two RTX 5090 workstations (64GB total) to run models that do not fit on a single 32GB GPU at the tested quantization levels.
  • For Qwen 3.5 27B and Qwen 2.5 32B (Q6_K), dual-GPU RPC shows relatively stable scaling with only modest overhead versus single-GPU throughput.
  • The Qwen 3.5 35B MoE benchmark highlights an interconnect bottleneck, where dual-GPU throughput is below what might be expected from simple scaling.
  • Larger MoE targets (Qwen 3.5 122B) still hit memory-limit failures on single GPUs, but are able to run on the distributed setup (reported as “Beast Mode ON”), indicating practical viability for very large models.
  • The test environment uses llama.cpp (build 8709 / commit 85d482e6b), runs llama-bench with specified parameters, and relies on 2.5GbE LAN performance as the primary distribution constraint.
[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE
Model Size Single 5090 (t/s) Dual 5090 RPC (t/s) Note
Qwen3.5-27B (Q6_K) 20.9 GB 59.83 55.41 -7% Overhead
Qwen3.5-35B MoE (Q6_K) 26.8 GB 206.76 150.99 Interconnect Bottleneck
Qwen2.5-32B (Q6_K) 25.0 GB 54.69 51.47 Stable Scaling
Qwen2.5-72B (Q4_K_M) 40.9 GB FAILED (OOM) 32.74 Now Playable!
Qwen3.5-122B MoE (IQ4_XS) 56.1 GB FAILED (OOM) 96.29 Beast Mode ON

The Setup

I recently tested the distributed inference capabilities of llama.cpp RPC using two identical workstations. This setup allows pooling VRAM (64GB total) to run models that are physically impossible to fit on a single 32GB card.

  • GPUs: 2x NVIDIA GeForce RTX 5090 (32GB VRAM each)
  • Interconnect: 2.5GbE LAN
  • OS: Ubuntu 24.04
  • Software: llama.cpp (Build 8709 / Commit 85d482e6b)
  • Method: llama-bench with ngl 99, fa 1, b 512, p 2048, n 256
  • Breaking the VRAM Barrier: The most significant result is the ability to run Qwen 2.5 72B and Qwen 3.5 122B. These models simply won't load on a single 32GB card at these quant levels. RPC effectively turns two machines into a 64GB unified AI workstation.
  • MoE Performance is King: The Qwen 3.5 122B MoE is the star of the show, hitting 96.29 tokens/sec. Even with the network latency of a distributed setup, MoE's sparse activation makes it incredibly viable for real-time use.
  • The 2.5GbE Bottleneck: For smaller, high-speed models like the 35B MoE, we see a 27% performance drop (206 -> 150 t/s) when moving to RPC. The 2.5GbE link is the bottleneck here. For the larger 72B/122B models, the computation time outweighs the transfer time, making the trade-off very worth it.
  • Prompt Processing (PP): On a single 5090, Qwen 3.5 35B hits 6190 t/s in prefill. Over RPC, this drops to 2823 t/s. The raw prefill power of Blackwell is insane, but it's heavily throttled by network bandwidth in distributed mode.

Benchmark Command
./llama-bench -m [model] -ngl 99 -fa 1 -p 2048 -n 256 -b 512 --rpc 192.168.X.X:50052

Conclusion

If you have two high-end GPUs in separate rigs, llama.cpp RPC is now mature enough to be a daily driver. It allows you to trade a bit of speed for the ability to run massive models that were previously reserved for professional H100/A100 clusters. Running a 122B model at nearly 100 t/s at home feels like the future.

https://preview.redd.it/f86vr9rdrytg1.png?width=2692&format=png&auto=webp&s=304b19a5bc34d44790519e67b9eb378394a071ca

submitted by /u/ReasonableDuty5319
[link] [comments]