Posted something when I initially got the GPU on r/IntelArc. Did not have vllm working at the time, so no real use case numbers. After many nights fighting with vllm, I finally got it to work.
Here are some summery.
- both llama.cpp and llm-scaler-vllm produce ~12tps token generation rate.
- tensor parallel degrade performance in all fronts (this may have something to do with my PCIe topology)
- pipeline parallel improves PP, but degrades TG at single query, improve both at high concurrency
- high concurrency performance is a lot better. TG reach 135 tps at 32 concurrency, which is about 20% less than RTX PRO 4500 32GB
- Power consumption at 32 concurrency is about 50% higher than RTX PRO 4500 32GB, which is consistent with spec. Power consumption is maxed out at PP step, it drop almost half during single query TG period. Power consumption does not maxed out during TG step even at high concurrency situation.
- you will need the latest beta fork to get qwen3.5 working.
- once you install ubuntu 26.04 (yes, pre-release version), no special driver installation is needed. i was not able to get ubuntu 24.04.4 working at all, and also not in any mood to install officially supported ubuntu 25.10, which will be obsolete in 3 months.
The below command-line prompt will get your vllm intel fork running qwen3.5 on Ubuntu 26.04 LTS
export HF_TOKEN="---your hf token---"
docker run -it --rm \
--name vllmb70 \
--ipc=host \
--shm-size=32gb \
--device /dev/dri:/dev/dri \
--privileged \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=$HF_TOKEN \
-e VLLM_TARGET_DEVICE="xpu" \
--entrypoint /bin/bash \
intel/llm-scaler-vllm:0.14.0-b8.1 \
-c "source /opt/intel/oneapi/setvars.sh --force && \
python3 -m vllm.entrypoints.openai.api\_server \\ \--model Intel/Qwen3.5-27B-int4-AutoRound \\ \--tokenizer Qwen/Qwen3.5-27B \\ \--served-model-name qwen3.5-27b \\ \--gpu-memory-utilization 0.92 \\ \--allow-deprecated-quantization \\ \--trust-remote-code \\ \--port 8000 \\ \--max-model-len 4096 \\ \--tensor-parallel-size 1 \\ \--pipeline-parallel-size 1 \\ \--enforce-eager \\ \--distributed-executor-backend mp" Below are measured token rate:
- Single GPU
Concurrency: 1
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| qwen3.5-27b | pp2048 | 1700.83 ± 7.03 | 1196.95 ± 13.22 | 1104.11 ± 13.22 | 1196.99 ± 13.22 | |
| qwen3.5-27b | tg512 | 13.43 ± 0.09 | 14.00 ± 0.00 |
Concurrency: 4
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| qwen3.5-27b | pp2048 (c4) | 1492.15 ± 93.77 | 802.83 ± 468.06 | 3155.68 ± 1403.00 | 3047.58 ± 1403.00 | 3155.71 ± 1402.98 | ||
| qwen3.5-27b | tg512 (c4) | 45.91 ± 0.46 | 12.03 ± 0.38 | 52.00 ± 0.00 | 13.00 ± 0.00 |
Concurrency: 8
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| qwen3.5-27b | pp2048 (c8) | 1554.80 ± 5.58 | 533.91 ± 466.39 | 5677.56 ± 2849.77 | 5580.43 ± 2849.77 | 5677.59 ± 2849.76 | ||
| qwen3.5-27b | tg512 (c8) | 84.37 ± 0.31 | 11.73 ± 0.72 | 112.00 ± 0.00 | 14.00 ± 0.00 |
Concurrency: 32 this basically saturates all the compute cores on B70.
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| qwen3.5-27b | pp2048 (c32) | 1503.41 ± 1.04 | 194.92 ± 302.24 | 20599.68 ± 11444.52 | 20509.48 ± 11444.52 | 20599.70 ± 11444.52 | ||
| qwen3.5-27b | tg512 (c32) | 130.90 ± 13.08 | 5.22 ± 0.91 | 288.00 ± 0.00 | 10.39 ± 1.60 |
Now Dual GPUs. Tensor Parallel 2
Concurrency: 1
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| qwen3.5-27b | pp2048 | 1019.80 ± 67.88 | 1962.77 ± 135.14 | 1835.82 ± 135.14 | 1962.82 ± 135.14 | |
| qwen3.5-27b | tg512 | 9.10 ± 0.45 | 11.00 ± 1.41 |
Concurrency: 32
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| qwen3.5-27b | pp2048 (c32) | 1057.36 ± 1.69 | 133.90 ± 206.98 | 29738.38 ± 16330.06 | 29597.02 ± 16330.06 | 29738.40 ± 16330.05 | ||
| qwen3.5-27b | tg512 (c32) | 140.30 ± 1.78 | 6.08 ± 1.14 | 320.00 ± 0.00 | 10.32 ± 0.47 |
Pipeline Parallel 2
Concurrency 1
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| qwen3.5-27b | pp2048 | 1680.59 ± 124.37 | 1367.69 ± 105.88 | 1161.99 ± 105.88 | 1367.74 ± 105.89 | |
| qwen3.5-27b | tg512 | 10.31 ± 0.01 | 12.00 ± 0.00 |
Concurrency 32
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| qwen3.5-27b | pp2048 (c32) | 2750.77 ± 1.96 | 261.41 ± 294.53 | 11889.30 ± 5927.16 | 11768.85 ± 5927.16 | 11889.32 ± 5927.16 | ||
| qwen3.5-27b | tg512 (c32) | 195.82 ± 4.09 | 7.14 ± 0.57 | 293.33 ± 7.54 | 9.51 ± 0.50 |
[link] [comments]




