Intel Arc B70 with HP z640 workstation (pcie 3)

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

A first-time local LLM user reports successfully running an Intel Arc B70 GPU in an older HP Z640 workstation (dual Xeon E5-V4, ~100GB RAM) after BIOS tuning and a boot workaround that requires the GPU to be connected to a powered-on monitor until the GRUB screen appears.
For LLM inference, they achieved good results with the Qwen3.6-35B-A3B-UD-Q4_K_XL gguf model using llama.cpp, reaching a very large ~131k context window, while they could not get vLLM (or other runtimes) working.
They note that both Vulkan and SYCL backends function in llama.cpp, with SYCL providing better performance in their setup; they’re on Ubuntu 26.04 (beta) and followed PR #22078 to compile the SYCL backend.
They provide a working llama-server configuration (cache settings, context size, quantization, and runtime flags like flash-attn and SYCL/Vulkan selection) and share benchmark numbers for prompt evaluation, token generation, and total latency.
They invite suggestions to further improve performance on the workstation, implying remaining tuning opportunities beyond the current configuration.

First-time local LLM user here!

I’m running an old HP Z640 workstation with a dual Xeon E5-V4 setup (around 100GB of RAM). It used to have a Titan X Pascal GPU, but I swapped it out for an Arc B70. I’m not sure if the motherboard supports PCI rebar, but I believe it supports above 4G decoding. After quite a bit of fiddling with BIOS settings, I finally managed to get the machine to boot with the B70 installed. The key to getting it to work was making sure the card was plugged into a monitor until the GRUB screen appeared. If the card wasn't connected to a powered-on monitor, the system wouldn’t boot and would just beep six to eight times.

For running LLMs, I’ve had good success with the Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf model using llama.cpp, which performs decently with a ~130k context window. I couldn’t get vllm or any other runtime to work, though. Both the Vulkan and SYCL backends work with llama.cpp, but SYCL is faster for me. I’m running Ubuntu 26.04 (beta) and followed the steps in PR #22078 to get the SYCL backend compiled and running.

Here are the configs that worked for me (though I’m still tweaking them):

./llama-server \ -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --alias "qwen-3.6-35b" \ --cache-type-k q8_0 --cache-type-v q8_0 \ -b 2048 -ub 1024 \ --flash-attn 1 \ --cache-ram 8192 \ -np 1 --host 0.0.0.0 --port 8100 \ -ngl all \ --ctx-size 131072 --temp 0.6 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --top-k 20 \ --ctx-checkpoints 32 --swa-full --jinja

Here’s some performance data:

Prompt eval time: 278,576.23 ms / 78,720 tokens (3.54 ms per token, 282.58 tokens per second)
Eval time: 15,292.59 ms / 181 tokens (84.49 ms per token, 11.84 tokens per second)
Total time: 293,868.82 ms / 78,901 tokens

Hope this helps anyone else with a similar setup! Im fairly new to running local LLMs, so please suggest ways i can get better performance from my box.

submitted by /u/Serious_Rub_3674
[link] [comments]