Exaggerated PCI-E bandwidth concerns?

Reddit r/LocalLLaMA / 5/7/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author frequently sees claims that multi-GPU local LLM setups on consumer motherboards are usually bottlenecked by PCIe bandwidth, and they want to test that assumption.
Using 2× RTX 5060 Ti with an intentionally weak second GPU link (PCIe 4.0 x4 through the chipset) and vLLM tensor parallelism (TP=2), they measured PCIe bandwidth during a prefill benchmark.
Even at 32k context, peak PCIe bandwidth usage was only about 3–4 GB/s, which is roughly 40–50% of the theoretical capacity of a PCIe 4.0 x4 connection.
Through higher quantization levels, observed throughput changes appear more tied to model/compute or VRAM behavior than to saturating the PCIe link, suggesting PCIe may not be the dominant bottleneck.
Based on these results, the author believes adding a third 5060 via an NVMe-to-PCIe adapter on a CPU-connected M.2 slot should be feasible without running into PCIe bandwidth limits, while a fourth GPU may be constrained by chipset lanes.

I frequently see (both here and on r/LocalLLM ) comments that multi-gpu setups are complex, problematic and typically bottlenecked by PCI-E bandwidth on consumer motherboards.

I am running 2x RTX 5060 TI 16gb ( and about to add a third ), and my PCIe setup is pretty bad. GPU0 is on a full x16 Gen 5 slot (running at 8x which is as fast as a 5060 can go) while GPU1 is stuck on PCI-E 4.0 x4 via chipset.

I created (with AI help) a little benchmark script to run a prefill benchmark (against vLLM running with TP=2) and monitor PCIe bandwidth consumption meanwhile.

I ran with 32k context (low enough to allow higher quants for the benchmark, but enough to saturate the processing).

The peak bandwidth consumed was 3 to 4 GB/s during prefill, which is only ~40-50% of even the weak 4.0 x4 link. The "faster" the quant the higher the bandwidth (I guess meaning the 5060s are VRAM bandwidth or compute limited).

Some prefill rates (TP=2):
QuantTrio/gemma-4-31B-it-AWQ-6Bit · Hugging Face: ~840-850 t/s
LilaRest/gemma-4-31B-it-NVFP4-turbo · Hugging Face: ~1500 t/s
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP · Hugging Face: 1600-1700 t/s

It seems realistic that i can safely add a third 5060 (via an NVME -> PCIe 5.0 x4 adapter using CPU connected M2 slot) without getting bottlenecked on PCI bandwidth. Adding a 4th is probably out with this motherboard though as that would require using more of the chipset lanes which is already the limiting factor.

I guess this post was post as an FYI, but also as a question of whether I am missing something obvious here? :)

submitted by /u/ziphnor
[link] [comments]

Black Hat USA

AI Business

Build Interactive Agents with Generative UI

The Batch

Barry Diller trusts Sam Altman. But ‘trust is irrelevant’ as AGI nears, he says.

TechCrunch

Released my first open source project — MIT-licensed CLI for AI-assisted commit messages

Dev.to

Stop Credentialing Your AI Agents Like It's 2019

Dev.to

Exaggerated PCI-E bandwidth concerns?

Key Points

Related Articles

Black Hat USA

Build Interactive Agents with Generative UI

Barry Diller trusts Sam Altman. But ‘trust is irrelevant’ as AGI nears, he says.

Released my first open source project — MIT-licensed CLI for AI-assisted commit messages

Stop Credentialing Your AI Agents Like It's 2019

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer