AI Navigate

Qwen3.5 27B and 35B with 2x AMD 7900 XTX vLLM bench serve results

Reddit r/LocalLLaMA / 3/21/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article presents vLLM bench-serve results for Qwen3.5-27B-AWQ-BF16-INT4 and Qwen3.5-35B-AWQ-4bit running on a 2x AMD 7900 XTX setup using a rocm/vllm-dev nightly container.
  • The author notes that enabling HSA_ENABLE_IPC_MODE_LEGACY=0 was key to achieving multi-GPU performance, while NCCL_P2P_DISABLE=1 was needed to disable P2P for vLLM operation.
  • In the 50-request benchmark with up to 30 concurrent requests, there were 50 successful requests, 0 failures, a 46.91-second run, and a total token throughput around 500 tok/s (with peak output throughput of 418.00 tok/s).
  • A separate server session reported higher throughputs (e.g., average prompt throughput around 1,436–2,010 tokens/s and average generation throughput up to 8.1 tokens/s) with varying GPU KV cache usage and prefix cache hit rates, illustrating performance variability.
  • The author suggests that prefix caching could improve performance once implemented, referencing an open GitHub issue.

I've enjoyed the recent reports of success with Qwen3.5 using vLLM with multiple AMD GPU, especially for such a dwindling market share these days! Here are some 'bench serve' results from 2x 7900 XTX and the smaller Qwen 3.5 models, cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 and cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit.

This was done with a fairly recent rocm/vllm-dev:nightly container: 0.17.2rc1.dev43+ge6c479770 kernel version: 6.19.8-cachyos-lto (maybe relevant) kernel cmdline: ttm.pages_limit=30720000 iommu=pt amdgpu.ppfeaturemask=0xfffd7fff

The key to getting this working at speed was using the poorly/undocumented/legacy env var HSA_ENABLE_IPC_MODE_LEGACY=0 Otherwise, it was necessary to disable NCCL P2P via NCCL_P2P_DISABLE=1 just to have vLLM serve the model. But whats the point of multi-GPU without some P2P! This will be even more reasonable if/when prefix caching gets implemented (https://github.com/vllm-project/vllm/issues/36493)

On to the numbers.. the TTFT are pretty poor, this was just a quick stab and smashing vLLM with traffic to see how it would go.

vllm bench serve --backend vllm --model cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 --endpoint /v1/completions --dataset-name sharegpt --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 50 --max-concurrency 30 --request-rate inf

============ Serving Benchmark Result ============ Successful requests: 50 Failed requests: 0 Maximum request concurrency: 30 Benchmark duration (s): 46.91 Total input tokens: 12852 Total generated tokens: 10623 Request throughput (req/s): 1.07 Output token throughput (tok/s): 226.45 Peak output token throughput (tok/s): 418.00 Peak concurrent requests: 33.00 Total token throughput (tok/s): 500.41 ---------------Time to First Token---------------- Mean TTFT (ms): 1626.60 Median TTFT (ms): 1951.13 P99 TTFT (ms): 3432.92 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 96.87 Median TPOT (ms): 87.50 P99 TPOT (ms): 253.70 ---------------Inter-token Latency---------------- Mean ITL (ms): 73.63 Median ITL (ms): 68.60 P99 ITL (ms): 410.73 ================================================== 

...some server logs from another session that had impressive throughput. (Not this above session)

(APIServer pid=1) INFO 03-20 20:19:44 [loggers.py:259] Engine 000: Avg prompt throughput: 1436.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 7 reqs, Waiting: 13 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 20:19:54 [loggers.py:259] Engine 000: Avg prompt throughput: 2010.5 tokens/s, Avg generation throughput: 8.1 tokens/s, Running: 14 reqs, Waiting: 6 reqs, GPU KV cache usage: 34.9%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 20:20:04 [loggers.py:259] Engine 000: Avg prompt throughput: 1723.1 tokens/s, Avg generation throughput: 13.9 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 50.7%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 20:20:14 [loggers.py:259] Engine 000: Avg prompt throughput: 574.4 tokens/s, Avg generation throughput: 271.9 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 51.5%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 20:20:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 306.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 58.8%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 20:20:34 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 304.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 58.8%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 20:20:44 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 117.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% 

vllm bench serve --backend vllm --model cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit --endpoint /v1/completions --dataset-name sharegpt --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200 --max-concurrency 50 --request-rate inf

============ Serving Benchmark Result ============ Successful requests: 200 Failed requests: 0 Maximum request concurrency: 50 Benchmark duration (s): 83.30 Total input tokens: 45055 Total generated tokens: 45249 Request throughput (req/s): 2.40 Output token throughput (tok/s): 543.20 Peak output token throughput (tok/s): 797.00 Peak concurrent requests: 56.00 Total token throughput (tok/s): 1084.08 ---------------Time to First Token---------------- Mean TTFT (ms): 536.74 Median TTFT (ms): 380.60 P99 TTFT (ms): 1730.17 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 79.70 Median TPOT (ms): 77.60 P99 TPOT (ms): 165.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 73.62 Median ITL (ms): 63.28 P99 ITL (ms): 172.72 ================================================== 

...the corresponding server log for the above run

(APIServer pid=1) INFO 03-20 21:01:07 [loggers.py:259] Engine 000: Avg prompt throughput: 1936.5 tokens/s, Avg generation throughput: 378.0 tokens/s, Running: 49 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.5%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:01:17 [loggers.py:259] Engine 000: Avg prompt throughput: 476.3 tokens/s, Avg generation throughput: 627.3 tokens/s, Running: 49 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.5%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:01:27 [loggers.py:259] Engine 000: Avg prompt throughput: 667.6 tokens/s, Avg generation throughput: 611.5 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 24.1%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:01:37 [loggers.py:259] Engine 000: Avg prompt throughput: 331.2 tokens/s, Avg generation throughput: 685.0 tokens/s, Running: 48 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.4%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:01:47 [loggers.py:259] Engine 000: Avg prompt throughput: 466.7 tokens/s, Avg generation throughput: 633.2 tokens/s, Running: 49 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.9%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:01:57 [loggers.py:259] Engine 000: Avg prompt throughput: 627.1 tokens/s, Avg generation throughput: 614.8 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 19.4%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:02:07 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 518.2 tokens/s, Running: 26 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.5%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:02:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 366.8 tokens/s, Running: 13 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.5%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:02:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 90.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-20 21:02:37 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% 
submitted by /u/bettertoknow
[link] [comments]