B70: Quick and Early Benchmarks & Backend Comparison

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The post reports early benchmarking and backend comparison for llama.cpp (commit f1f793ad0) running a 27B Qwen3.5 GGUF model across SYCL, Vulkan, and OpenVINO backends.
  • SYCL results show very large throughput for certain configurations (e.g., ~798 t/s at pp512 and ~709 t/s at pp16384) while the tensor-splitting variants (tg128/tg512) run much slower (~15 t/s).
  • Vulkan on the Intel(R) Graphics BMG G31 (Mesa-based) produces lower throughput than SYCL for the same model, with pp512 around ~504 t/s and pp16384 around ~449 t/s, while tg128/tg512 remain ~14 t/s.
  • OpenVINO is described as not yet reliably working end-to-end on the GPU path, with an execution error indicating a pre-allocated tensor cannot perform a CPY operation.
  • The author frames this as “barely on the brink of working,” noting dependencies like oneAPI runtime stability, an updated kernel/xe firmware, and Mesa built from source, suggesting the environment is still in flux.

llama.cpp: f1f793ad0 (8657)

This is a quick attempt to just get it up and running. Lots of oneapi runtime still using "stable" from Intels repo. Kernel 6.19.8+deb13-amd64 with an updated xe firmware built. Vulkan is Debian but using latest Mesa compiled from source. Openvino is 2026.0. Feels like everything is "barely on the brink of working" (which is to be expected).

sycl:

$ build/bin/llama-bench -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL -p 512,16384 -n 128,512 | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | SYCL | 99 | pp512 | 798.07 ± 2.72 | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | SYCL | 99 | pp16384 | 708.99 ± 1.90 | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | SYCL | 99 | tg128 | 15.64 ± 0.01 | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | SYCL | 99 | tg512 | 15.61 ± 0.00 | 

Vulkan:

$ bin/llama-bench -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL -p 512,16384 -n 128,512 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Intel(R) Graphics (BMG G31) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | pp512 | 504.19 ± 0.26 | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | pp16384 | 448.74 ± 0.04 | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | tg128 | 14.10 ± 0.01 | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | tg512 | 14.08 ± 0.00 | 

Openvino:

$ GGML_OPENVINO_DEVICE=GPU GGML_OPENVINO_STATEFUL_EXECUTION=1 build_ov/bin/llama-bench -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL -p OpenVINO: using device GPU | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | /home/aaron/src/llama.cpp/ggml/src/ggml-backend.cpp:809: pre-allocated tensor (cache_r_l0 (view) (copy of )) in a buffer (OPENVINO0) that cannot run the operation (CPY) /home/aaron/src/llama.cpp/build_ov/bin/libggml-base.so.0(+0x15a25) [0x7f6183d72a25] /home/aaron/src/llama.cpp/build_ov/bin/libggml-base.so.0(ggml_print_backtrace+0x1df) [0x7f6183d72def] /home/aaron/src/llama.cpp/build_ov/bin/libggml-base.so.0(ggml_abort+0x11e) [0x7f6183d72f7e] /home/aaron/src/llama.cpp/build_ov/bin/libggml-base.so.0(+0x2cf9c) [0x7f6183d89f9c] /home/aaron/src/llama.cpp/build_ov/bin/libggml-base.so.0(ggml_backend_sched_split_graph+0xd3f) [0x7f6183d8bfbf] /home/aaron/src/llama.cpp/build_ov/bin/libllama.so.0(_ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPm+0x5f6) [0x7f6183ebd466] /home/aaron/src/llama.cpp/build_ov/bin/libllama.so.0(_ZN13llama_context13sched_reserveEv+0xf75) [0x7f6183ebf3f5] /home/aaron/src/llama.cpp/build_ov/bin/libllama.so.0(_ZN13llama_contextC2ERK11llama_model20llama_context_params+0xab9) [0x7f6183ec07d9] /home/aaron/src/llama.cpp/build_ov/bin/libllama.so.0(llama_init_from_model+0x11f) [0x7f6183ec155f] build_ov/bin/llama-bench(+0x309bf) [0x55fc464089bf] /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f6183035ca8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f6183035d65] build_ov/bin/llama-bench(+0x32e71) [0x55fc4640ae71] Aborted 

(I swear I had this running before getting Vulkan going)

submitted by /u/abotsis
[link] [comments]