Multi-GPU? Check your PCI-E lanes! x570, Doubled my prompt proc. speed by switching 'primary' devices, on an asymmetrical x16 / x4 lane setup.

Reddit r/LocalLLaMA / 3/18/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A user reports that adding export CUDA_VISIBLE_DEVICES="1,0" to the llama.cpp launch script doubled prompt processing speed for oversized MoE models on a dual RTX 3090 system with an asymmetric PCIe lane split (16x/4x) on an x570 motherboard.
The improvement is attributed to the CUDA0 device occupying the 4-lane slot, and swapping the device order shifts workload to the other GPU, boosting throughput in that specific setup.
The tip is cautioned to only apply on systems with asymmetric PCIe lanes or differing GPU lane counts; users should verify lane configurations with nvtop or lspci before changing GPU order.
The author measured a jump from about 70 t/s to 140 t/s and notes the result may be less pronounced on symmetric x8/x8 setups or non-oversized models, recommending personal testing.

Short version - in my situation, adding export CUDA_VISIBLE_DEVICES="1,0" to my llama.cpp launch script doubled prompt processing speed for me in some situations.

Folks, I've been running a dual 3090 setup on a system that splits the PCI-E lanes 16x / 4x between the two "x16" slots (common on x570 boards, I believe). For whatever reason, by default, at least in my setup (Ubuntu-Server 24.04 Nvidia 580.126.20 drivers, x570 board), the CUDA0 device is the one on the 4-lane PCI express slot.

I added this line to my run-llama.cpp.sh script, and my prompt processing speed - at least for MoE models - has doubled. Don't do this unless you're similarly split up asymmetrically in terms of PCI-E lanes, or GPU performance order. Check your lanes using either nvtop, or the more verbose lspci options to check link speeds.

For oversized MoE models, I've jumped from PP of 70 t/s to 140 t/s, and I'm thrilled. Had to share the love.

This is irrelevant if your system does an x8/x8 split, but relevant if you have either two different lane counts, or have two different GPUs. It may not matter as much with something like ik_llama.cpp that splits between GPUs differently, or vLLM, as I haven't tested, but at least with the current stock llama.cpp, it makes a big difference for me!

I'm thrilled to see this free performance boost.

How did I discover this? I was watching nvtop recently, and noticed that during prompt processing, the majority of work was happening on GPU0 / CUDA0 - and I remembered that it's only using 4 lanes. I expected a modest change in performance, but doubling PP t/s was so unexpected that I've had to test it several times to make sure I'm not nuts, and have compared it against older benchmarks, and current benchmarks with and without the swap. Dang!

I'll try to update in a bit to note if there's as much of a difference on non-oversized models - I'll guess there's a marginal improvement in those circumstances. But, I bet I'm far from the only person here with a DDR4 x570 system and two GPUs - so I hope I can make someone else's day better!

submitted by /u/overand
[link] [comments]