Was playing around with NVLink and was somewhat surprised it made a meaningful difference, even for generation speeds.
If you are confused why same PLX chip is the slowest, with stock drivers, consumer gpu's can't communicate directly with each other over pcie, they are fighting over the same x16 link back to the CPU. (effectively an x8 pcie link each)
2x 3090 - Qwen3.5 27b fp8 - [NVLink installed - different CPU's]:
--- Single Generation (mtp 2) ---
Tokens : 1024
Time : 12.90s
Speed : 79.4 tok/s
--- Concurrent Generation (n=20) ---
Total tokens : 20480
Wall time : 29.54s
Throughput : 693.2 tok/s (aggregate)
--- Prefill / TTFT (target ~8000 input tokens) ---
Input : 15381 tokens (from server)
TTFT : 7053 ms (total 7073ms - ~20ms gen)
Prefill: 2,181 tok/s
2x 3090 - Qwen3.5 27b fp8 - [No NVLink - Different PLX Chip, Same CPU]:
--- Single Generation ---
Tokens : 1024
Time : 13.78s
Speed : 74.3 tok/s
--- Concurrent Generation (n=20) ---
Total tokens : 20480
Wall time : 37.80s
Throughput : 541.8 tok/s (aggregate)
--- Prefill / TTFT (target ~8000 input tokens) ---
Input : 15368 tokens (from server)
TTFT : 9165 ms (total 9186ms - ~21ms gen)
Prefill: 1,677 tok/s
2x 3090 - Qwen3.5 27b fp8 - [No NVLink - Different CPU's]:
--- Single Generation ---
Tokens : 1024
Time : 13.95s
Speed : 73.4 tok/s
--- Concurrent Generation (n=20) ---
Total tokens : 20480
Wall time : 37.86s
Throughput : 541.0 tok/s (aggregate)
--- Prefill / TTFT (target ~8000 input tokens) ---
Input : 15442 tokens (from server)
TTFT : 9219 ms (total 9240ms - ~21ms gen)
Prefill: 1,675 tok/s
2x 3090 - Qwen3.5 27b fp8 - [No NVLink - Same PLX Chip]:
--- Single Generation (mtp 2)---
Tokens : 1024
Time : 14.58s
Speed : 70.2 tok/s
--- Concurrent Generation (n=20) ---
Total tokens : 20480
Wall time : 41.56s
Throughput : 492.8 tok/s (aggregate)
--- Prefill / TTFT (target ~8000 input tokens) ---
Input : 15287 tokens (from server)
TTFT : 10955 ms (total 10977ms - ~22ms gen)
Prefill: 1,395 tok/s
[link] [comments]




