I frequently see (both here and on r/LocalLLM ) comments that multi-gpu setups are complex, problematic and typically bottlenecked by PCI-E bandwidth on consumer motherboards.
I am running 2x RTX 5060 TI 16gb ( and about to add a third ), and my PCIe setup is pretty bad. GPU0 is on a full x16 Gen 5 slot (running at 8x which is as fast as a 5060 can go) while GPU1 is stuck on PCI-E 4.0 x4 via chipset.
I created (with AI help) a little benchmark script to run a prefill benchmark (against vLLM running with TP=2) and monitor PCIe bandwidth consumption meanwhile.
I ran with 32k context (low enough to allow higher quants for the benchmark, but enough to saturate the processing).
The peak bandwidth consumed was 3 to 4 GB/s during prefill, which is only ~40-50% of even the weak 4.0 x4 link. The "faster" the quant the higher the bandwidth (I guess meaning the 5060s are VRAM bandwidth or compute limited).
Some prefill rates (TP=2):
QuantTrio/gemma-4-31B-it-AWQ-6Bit · Hugging Face: ~840-850 t/s
LilaRest/gemma-4-31B-it-NVFP4-turbo · Hugging Face: ~1500 t/s
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP · Hugging Face: 1600-1700 t/s
It seems realistic that i can safely add a third 5060 (via an NVME -> PCIe 5.0 x4 adapter using CPU connected M2 slot) without getting bottlenecked on PCI bandwidth. Adding a 4th is probably out with this motherboard though as that would require using more of the chipset lanes which is already the limiting factor.
I guess this post was post as an FYI, but also as a question of whether I am missing something obvious here? :)
[link] [comments]




