[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted)

Reddit r/LocalLLaMA / 4/7/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • llama.cpp’s SYCL backend previously failed to apply its “reorder” memory-layout optimization to Q8_0 quantization on Intel Arc (Xe2/Battlemage) GPUs, causing poor cache/memory efficiency.
  • The issue led to Q8_0 reaching only ~21% of theoretical memory bandwidth (4.88 t/s) versus much higher throughput for lower-bit Q4_K_M (20.56 t/s).
  • A submitted fix extends the existing reorder framework (about ~200 lines of code) to Q8_0 and corrects a critical buffer-init bug where the “reorder” struct wasn’t allocated/flagged.
  • After the change on Intel Arc Pro B70, Q8_0 throughput increases to ~15.24 t/s (~66% bandwidth), yielding a reported 3.1x speedup in token generation and making Q8_0 faster than Q6_K in the author’s tests.
  • Validation against Intel’s closed-source IPEX-LLM (via binary patching for hardware compatibility) suggests the performance target is achievable, supporting that the root cause was in the reorder/dispatch path rather than drivers or VRAM constraints.

TL;DR: Q8_0 quantization on Intel Xe2 (Battlemage/Arc B-series) GPUs was achieving only 21% of theoretical memory bandwidth. My AI Agent and I found the root cause and submitted a fix that brings it to 66% - a 3.1x speedup in token generation.

The problem:

On Intel Arc Pro B70, Q8_0 models ran at 4.88 t/s while Q4_K_M ran at 20.56 t/s; a 4x gap that shouldn't exist since Q8_0 only has 1.7x more data. After ruling out VRAM pressure, drivers, and backend issues, we traced it to the SYCL kernel dispatch path.

Root cause:

llama.cpp's SYCL backend has a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This was implemented for Q4_0, Q4_K, and Q6_K - but Q8_0 was never added. Q8_0's 34-byte blocks (not power-of-2) make the non-reordered layout especially bad for GPU cache performance.

Sooo, the fix:

~200 lines of code extending the existing reorder framework to Q8_0. The most critical bug was actually a single line - Q8_0 tensors weren't getting the "extra" struct allocated during buffer init, so the reorder flag was silently never set.

Results on Qwen3.5-27B (Intel Arc Pro B70):

  • Q8_0 before: 4.88 t/s (21% bandwidth)
  • **Q8_0 after: 15.24 t/s (66% bandwidth) - 3.1x faster*\*
  • Q4_K_M: 20.12 t/s (unchanged)
  • Q6_K: 13.83 t/s (no reorder)

Q8_0 is now faster than Q6_K (15.24 vs 13.83 t/s) in my testing; while providing higher quality.

Validation: Before writing the fix, we binary-patched Intel's closed-source IPEX-LLM to run on my GPU (it doesn't support B70's PCI device ID). Their optimized Q8_0 kernels achieved 61% bandwidth, confirming the problem was solvable. My open-source implementation achieves 66%.

PR: https://github.com/ggml-org/llama.cpp/pull/21527

Issue: https://github.com/ggml-org/llama.cpp/issues/21517

Hardware: Intel Arc Pro B70, 32 GB GDDR6, 608 GB/s bandwidth

submitted by /u/Katostrofik
[link] [comments]