[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The benchmark demonstrates that llama.cpp RPC can pool VRAM across two RTX 5090 workstations (64GB total) to run models that do not fit on a single 32GB GPU at the tested quantization levels.
For Qwen 3.5 27B and Qwen 2.5 32B (Q6_K), dual-GPU RPC shows relatively stable scaling with only modest overhead versus single-GPU throughput.
The Qwen 3.5 35B MoE benchmark highlights an interconnect bottleneck, where dual-GPU throughput is below what might be expected from simple scaling.
Larger MoE targets (Qwen 3.5 122B) still hit memory-limit failures on single GPUs, but are able to run on the distributed setup (reported as “Beast Mode ON”), indicating practical viability for very large models.
The test environment uses llama.cpp (build 8709 / commit 85d482e6b), runs llama-bench with specified parameters, and relies on 2.5GbE LAN performance as the primary distribution constraint.

[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE

Model	Size	Single 5090 (t/s)	Dual 5090 RPC (t/s)	Note
Qwen3.5-27B (Q6_K)	20.9 GB	59.83	55.41	-7% Overhead
Qwen3.5-35B MoE (Q6_K)	26.8 GB	206.76	150.99	Interconnect Bottleneck
Qwen2.5-32B (Q6_K)	25.0 GB	54.69	51.47	Stable Scaling
Qwen2.5-72B (Q4_K_M)	40.9 GB	FAILED (OOM)	32.74	Now Playable!
Qwen3.5-122B MoE (IQ4_XS)	56.1 GB	FAILED (OOM)	96.29	Beast Mode ON

The Setup

I recently tested the distributed inference capabilities of llama.cpp RPC using two identical workstations. This setup allows pooling VRAM (64GB total) to run models that are physically impossible to fit on a single 32GB card.

GPUs: 2x NVIDIA GeForce RTX 5090 (32GB VRAM each)
Interconnect: 2.5GbE LAN
OS: Ubuntu 24.04
Software: llama.cpp (Build 8709 / Commit 85d482e6b)
Method: llama-bench with ngl 99, fa 1, b 512, p 2048, n 256
Breaking the VRAM Barrier: The most significant result is the ability to run Qwen 2.5 72B and Qwen 3.5 122B. These models simply won't load on a single 32GB card at these quant levels. RPC effectively turns two machines into a 64GB unified AI workstation.
MoE Performance is King: The Qwen 3.5 122B MoE is the star of the show, hitting 96.29 tokens/sec. Even with the network latency of a distributed setup, MoE's sparse activation makes it incredibly viable for real-time use.
The 2.5GbE Bottleneck: For smaller, high-speed models like the 35B MoE, we see a 27% performance drop (206 -> 150 t/s) when moving to RPC. The 2.5GbE link is the bottleneck here. For the larger 72B/122B models, the computation time outweighs the transfer time, making the trade-off very worth it.
Prompt Processing (PP): On a single 5090, Qwen 3.5 35B hits 6190 t/s in prefill. Over RPC, this drops to 2823 t/s. The raw prefill power of Blackwell is insane, but it's heavily throttled by network bandwidth in distributed mode.

Benchmark Command
./llama-bench -m [model] -ngl 99 -fa 1 -p 2048 -n 256 -b 512 --rpc 192.168.X.X:50052

Conclusion

If you have two high-end GPUs in separate rigs, llama.cpp RPC is now mature enough to be a daily driver. It allows you to trade a bit of speed for the ability to run massive models that were previously reserved for professional H100/A100 clusters. Running a 122B model at nearly 100 t/s at home feels like the future.

https://preview.redd.it/f86vr9rdrytg1.png?width=2692&format=png&auto=webp&s=304b19a5bc34d44790519e67b9eb378394a071ca

submitted by /u/ReasonableDuty5319
[link] [comments]