Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4

Reddit r/LocalLLaMA / 4/11/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The article reports that Intel Arc Pro B70 32GB can run Qwen3.5-27B quantized to Q4 successfully with vLLM, after troubleshooting multiple nights to get the Intel vLLM setup working.
Measured throughput shows both llama.cpp and llm-scaler-vLLM reach roughly ~12 tokens/second under the tested conditions, while scaling strategy strongly affects results (tensor parallel can degrade performance).
Using pipeline parallel improves token generation performance for single-query scenarios, while higher concurrency significantly boosts throughput, reaching about 135 tps at 32 concurrent requests (around 20% lower than an RTX PRO 4500 32GB).
The author observes power consumption at 32 concurrency is about 50% higher than the RTX PRO 4500 32GB, consistent with specifications, and notes power behavior differs between pipeline parallel (maxing out during PP step) versus single-query periods.
To get Qwen3.5 working, the author says a latest beta fork is required, and they had difficulty with Ubuntu 24.04.4 but report success on Ubuntu 26.04 (pre-release) without special driver installation.

Posted something when I initially got the GPU on r/IntelArc. Did not have vllm working at the time, so no real use case numbers. After many nights fighting with vllm, I finally got it to work.

Here are some summery.

both llama.cpp and llm-scaler-vllm produce ~12tps token generation rate.
tensor parallel degrade performance in all fronts (this may have something to do with my PCIe topology)
pipeline parallel improves PP, but degrades TG at single query, improve both at high concurrency
high concurrency performance is a lot better. TG reach 135 tps at 32 concurrency, which is about 20% less than RTX PRO 4500 32GB
Power consumption at 32 concurrency is about 50% higher than RTX PRO 4500 32GB, which is consistent with spec. Power consumption is maxed out at PP step, it drop almost half during single query TG period. Power consumption does not maxed out during TG step even at high concurrency situation.
you will need the latest beta fork to get qwen3.5 working.
once you install ubuntu 26.04 (yes, pre-release version), no special driver installation is needed. i was not able to get ubuntu 24.04.4 working at all, and also not in any mood to install officially supported ubuntu 25.10, which will be obsolete in 3 months.

The below command-line prompt will get your vllm intel fork running qwen3.5 on Ubuntu 26.04 LTS

export HF_TOKEN="---your hf token---"

docker run -it --rm \

--name vllmb70 \

--ipc=host \

--shm-size=32gb \

--device /dev/dri:/dev/dri \

--privileged \

-p 8000:8000 \

-v ~/.cache/huggingface:/root/.cache/huggingface \

-e HF_TOKEN=$HF_TOKEN \

-e VLLM_TARGET_DEVICE="xpu" \

--entrypoint /bin/bash \

intel/llm-scaler-vllm:0.14.0-b8.1 \

-c "source /opt/intel/oneapi/setvars.sh --force && \

python3 -m vllm.entrypoints.openai.api\_server \\ \--model Intel/Qwen3.5-27B-int4-AutoRound \\ \--tokenizer Qwen/Qwen3.5-27B \\ \--served-model-name qwen3.5-27b \\ \--gpu-memory-utilization 0.92 \\ \--allow-deprecated-quantization \\ \--trust-remote-code \\ \--port 8000 \\ \--max-model-len 4096 \\ \--tensor-parallel-size 1 \\ \--pipeline-parallel-size 1 \\ \--enforce-eager \\ \--distributed-executor-backend mp"

Below are measured token rate:

Single GPU

Concurrency: 1

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048	1700.83 ± 7.03		1196.95 ± 13.22	1104.11 ± 13.22	1196.99 ± 13.22
qwen3.5-27b	tg512	13.43 ± 0.09	14.00 ± 0.00

Concurrency: 4

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c4)	1492.15 ± 93.77	802.83 ± 468.06			3155.68 ± 1403.00	3047.58 ± 1403.00	3155.71 ± 1402.98
qwen3.5-27b	tg512 (c4)	45.91 ± 0.46	12.03 ± 0.38	52.00 ± 0.00	13.00 ± 0.00

Concurrency: 8

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c8)	1554.80 ± 5.58	533.91 ± 466.39			5677.56 ± 2849.77	5580.43 ± 2849.77	5677.59 ± 2849.76
qwen3.5-27b	tg512 (c8)	84.37 ± 0.31	11.73 ± 0.72	112.00 ± 0.00	14.00 ± 0.00

Concurrency: 32 this basically saturates all the compute cores on B70.

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c32)	1503.41 ± 1.04	194.92 ± 302.24			20599.68 ± 11444.52	20509.48 ± 11444.52	20599.70 ± 11444.52
qwen3.5-27b	tg512 (c32)	130.90 ± 13.08	5.22 ± 0.91	288.00 ± 0.00	10.39 ± 1.60

Now Dual GPUs. Tensor Parallel 2

Concurrency: 1

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048	1019.80 ± 67.88		1962.77 ± 135.14	1835.82 ± 135.14	1962.82 ± 135.14
qwen3.5-27b	tg512	9.10 ± 0.45	11.00 ± 1.41

Concurrency: 32

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c32)	1057.36 ± 1.69	133.90 ± 206.98			29738.38 ± 16330.06	29597.02 ± 16330.06	29738.40 ± 16330.05
qwen3.5-27b	tg512 (c32)	140.30 ± 1.78	6.08 ± 1.14	320.00 ± 0.00	10.32 ± 0.47

Pipeline Parallel 2

Concurrency 1

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048	1680.59 ± 124.37		1367.69 ± 105.88	1161.99 ± 105.88	1367.74 ± 105.89
qwen3.5-27b	tg512	10.31 ± 0.01	12.00 ± 0.00

Concurrency 32

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c32)	2750.77 ± 1.96	261.41 ± 294.53			11889.30 ± 5927.16	11768.85 ± 5927.16	11889.32 ± 5927.16
qwen3.5-27b	tg512 (c32)	195.82 ± 4.09	7.14 ± 0.57	293.33 ± 7.54	9.51 ± 0.50

submitted by /u/Puzzleheaded_Base302
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage

Dev.to

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

Dev.to

Amazon S3 Files: The End of the Object vs. File War (And Why It Matters in the AI Agent Era)

Dev.to

Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4

Key Points

Related Articles

Black Hat USA

Black Hat Asia

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

Amazon S3 Files: The End of the Object vs. File War (And Why It Matters in the AI Agent Era)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

Black Hat USA

Black Hat Asia

Fully Automated Website 2026-04-11: **The Scoreboard — Visual Judge Score Comparison on the Homepage**

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

Amazon S3 Files: The End of the Object vs. File War (And Why It Matters in the AI Agent Era)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage