3090 NVLink testing w/ Q3.5 27B

Reddit r/LocalLLaMA / 3/11/2026

Developer Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The user tested NVIDIA NVLink on 2x RTX 3090 GPUs with Qwen 3.5 27B FP8 model and observed significant improvements in generation speed compared to setups without NVLink.
NVLink enables direct GPU-to-GPU communication across different CPUs, improving throughput and lowering token generation times.
Tests showed that setups without NVLink, especially with the same PLX chip or same CPU, suffered from slower generation speeds due to PCIe bandwidth contention.
Concurrent token generation throughput increased from roughly 493-542 tok/s without NVLink to about 693 tok/s with NVLink, showing meaningful performance gains.
The findings highlight the importance of inter-GPU communication infrastructure in multi-GPU AI model inference workloads.

Was playing around with NVLink and was somewhat surprised it made a meaningful difference, even for generation speeds.

If you are confused why same PLX chip is the slowest, with stock drivers, consumer gpu's can't communicate directly with each other over pcie, they are fighting over the same x16 link back to the CPU. (effectively an x8 pcie link each)

2x 3090 - Qwen3.5 27b fp8 - [NVLink installed - different CPU's]:
--- Single Generation (mtp 2) ---
Tokens : 1024
Time : 12.90s
Speed : 79.4 tok/s
--- Concurrent Generation (n=20) ---
Total tokens : 20480
Wall time : 29.54s
Throughput : 693.2 tok/s (aggregate)
--- Prefill / TTFT (target ~8000 input tokens) ---
Input : 15381 tokens (from server)
TTFT : 7053 ms (total 7073ms - ~20ms gen)
Prefill: 2,181 tok/s

2x 3090 - Qwen3.5 27b fp8 - [No NVLink - Different PLX Chip, Same CPU]:
--- Single Generation ---
Tokens : 1024
Time : 13.78s
Speed : 74.3 tok/s
--- Concurrent Generation (n=20) ---
Total tokens : 20480
Wall time : 37.80s
Throughput : 541.8 tok/s (aggregate)
--- Prefill / TTFT (target ~8000 input tokens) ---
Input : 15368 tokens (from server)
TTFT : 9165 ms (total 9186ms - ~21ms gen)
Prefill: 1,677 tok/s

2x 3090 - Qwen3.5 27b fp8 - [No NVLink - Different CPU's]:
--- Single Generation ---
Tokens : 1024
Time : 13.95s
Speed : 73.4 tok/s
--- Concurrent Generation (n=20) ---
Total tokens : 20480
Wall time : 37.86s
Throughput : 541.0 tok/s (aggregate)
--- Prefill / TTFT (target ~8000 input tokens) ---
Input : 15442 tokens (from server)
TTFT : 9219 ms (total 9240ms - ~21ms gen)
Prefill: 1,675 tok/s

2x 3090 - Qwen3.5 27b fp8 - [No NVLink - Same PLX Chip]:
--- Single Generation (mtp 2)---
Tokens : 1024
Time : 14.58s
Speed : 70.2 tok/s
--- Concurrent Generation (n=20) ---
Total tokens : 20480
Wall time : 41.56s
Throughput : 492.8 tok/s (aggregate)
--- Prefill / TTFT (target ~8000 input tokens) ---
Input : 15287 tokens (from server)
TTFT : 10955 ms (total 10977ms - ~22ms gen)
Prefill: 1,395 tok/s