|
The SetupI recently tested the distributed inference capabilities of llama.cpp RPC using two identical workstations. This setup allows pooling VRAM (64GB total) to run models that are physically impossible to fit on a single 32GB card.
Benchmark Command ConclusionIf you have two high-end GPUs in separate rigs, llama.cpp RPC is now mature enough to be a daily driver. It allows you to trade a bit of speed for the ability to run massive models that were previously reserved for professional H100/A100 clusters. Running a 122B model at nearly 100 t/s at home feels like the future. [link] [comments] |
[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE
Reddit r/LocalLLaMA / 4/8/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The benchmark demonstrates that llama.cpp RPC can pool VRAM across two RTX 5090 workstations (64GB total) to run models that do not fit on a single 32GB GPU at the tested quantization levels.
- For Qwen 3.5 27B and Qwen 2.5 32B (Q6_K), dual-GPU RPC shows relatively stable scaling with only modest overhead versus single-GPU throughput.
- The Qwen 3.5 35B MoE benchmark highlights an interconnect bottleneck, where dual-GPU throughput is below what might be expected from simple scaling.
- Larger MoE targets (Qwen 3.5 122B) still hit memory-limit failures on single GPUs, but are able to run on the distributed setup (reported as “Beast Mode ON”), indicating practical viability for very large models.
- The test environment uses llama.cpp (build 8709 / commit 85d482e6b), runs llama-bench with specified parameters, and relies on 2.5GbE LAN performance as the primary distribution constraint.



