First-time local LLM user here!
I’m running an old HP Z640 workstation with a dual Xeon E5-V4 setup (around 100GB of RAM). It used to have a Titan X Pascal GPU, but I swapped it out for an Arc B70. I’m not sure if the motherboard supports PCI rebar, but I believe it supports above 4G decoding. After quite a bit of fiddling with BIOS settings, I finally managed to get the machine to boot with the B70 installed. The key to getting it to work was making sure the card was plugged into a monitor until the GRUB screen appeared. If the card wasn't connected to a powered-on monitor, the system wouldn’t boot and would just beep six to eight times.
For running LLMs, I’ve had good success with the Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf model using llama.cpp, which performs decently with a ~130k context window. I couldn’t get vllm or any other runtime to work, though. Both the Vulkan and SYCL backends work with llama.cpp, but SYCL is faster for me. I’m running Ubuntu 26.04 (beta) and followed the steps in PR #22078 to get the SYCL backend compiled and running.
Here are the configs that worked for me (though I’m still tweaking them):
./llama-server \ -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --alias "qwen-3.6-35b" \ --cache-type-k q8_0 --cache-type-v q8_0 \ -b 2048 -ub 1024 \ --flash-attn 1 \ --cache-ram 8192 \ -np 1 --host 0.0.0.0 --port 8100 \ -ngl all \ --ctx-size 131072 --temp 0.6 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --top-k 20 \ --ctx-checkpoints 32 --swa-full --jinja Here’s some performance data:
- Prompt eval time: 278,576.23 ms / 78,720 tokens (3.54 ms per token, 282.58 tokens per second)
- Eval time: 15,292.59 ms / 181 tokens (84.49 ms per token, 11.84 tokens per second)
- Total time: 293,868.82 ms / 78,901 tokens
Hope this helps anyone else with a similar setup! Im fairly new to running local LLMs, so please suggest ways i can get better performance from my box.
[link] [comments]




