Dual 3090 setup - performance optimization

Reddit r/LocalLLaMA / 4/11/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Read original →

共有:

Key Points

A user reports performance results running large Qwen models on a dual-RTX 3090 setup, noting strong prompt processing (pp/s) but comparatively poor token generation (tg/s) when using split-mode on PCIe lanes with uneven bandwidth (PCIe4 x16 vs PCIe3 x4).
Experiments comparing backends (ik_llama.cpp vs llama.cpp, and vLLM) show that vLLM can yield higher tg/s for a 27B model, but may significantly reduce throughput and introduce long startup times depending on quantization and configuration.
The user theorizes that improving GPU-to-GPU transfer speeds—e.g., via an x570 motherboard with PCIe4 configured at 8x/8x—could improve tg/s for split row/graph modes, but is hesitant due to the complexity of swapping hardware in a water-cooled loop.
They include detailed per-model benchmarking and quantization settings (including custom Q8_K_L variants with selective BF16 tensor overrides) and credit tools/work like kld-sweep for quant comparison and tuning.
The post asks for community benchmarks from others using dual 3090 setups, especially those with better PCIe configurations or transfer paths.

I have this machine right now:

MSI B550-A PRO
Ryzen 5 5600X, 4x16GB DDR4 3200 MHz
RTX 3090 - PCIe4 x16 (~25GB/s)
RTX 3090 - PCIe3 x4 (<3GB/s..)

I added the second GPU just recently and after a day of optimizing stuff settled on this setup:

Model name	Model quant	KV quant	--ctx-size	pp/s	tg/s	Engine
Qwen3.5-122B-A10B	AesSedai Q4_K_M	q8_0	80000	1000	22	ik_llama.cpp
Qwen3.5-27B	PaMRxR Q8_K_L	bf16	200000	1950	25	llama.cpp
Qwen3.5-35B-A3B	PaMRxR Q8_K_L	bf16	260000	4366	102	llama.cpp

With --split-mode layer things work well, especially pp, but tg is not so ideal. With vLLM I got 50-60 tg/s on the 27B, but with a worse quant, a lot worse 600 pp/s and abysmal startup time. Overall not really worth it.

I wonder what others with dual 3090 get with these or similar models, especially if you have better transfer speeds between the GPUs? I suspect an X570 motherboard with PCIe4 8x/8x could improve tg especially with --split-mode row / graph. I just don't want to go into replacing it blindly because everything is wired in a water cooling loop which took a lot of time to setup. NVLink is unfortunately not possible as the GPUs are different brands.

Side note: the Q8_K_L are my own quantizations, basically Q8_0 with a few tensors selectively overridden to BF16. Still smaller than UD-Q8_K_XL while achieving better KLD. Credits to /u/TitwitMuffbiscuit and his kld-sweep tool which makes it easy to compare ppl/kld of multiple quants.

submitted by /u/PaMRxR
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage

Dev.to

Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in

Dev.to

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

Dev.to

Dual 3090 setup - performance optimization

Key Points

Related Articles

Black Hat USA

Black Hat Asia

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage

Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

Black Hat USA

Black Hat Asia

Fully Automated Website 2026-04-11: **The Scoreboard — Visual Judge Score Comparison on the Homepage**

Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage