TL;DR: Q8_0 quantization on Intel Xe2 (Battlemage/Arc B-series) GPUs was achieving only 21% of theoretical memory bandwidth. My AI Agent and I found the root cause and submitted a fix that brings it to 66% - a 3.1x speedup in token generation.
The problem:
On Intel Arc Pro B70, Q8_0 models ran at 4.88 t/s while Q4_K_M ran at 20.56 t/s; a 4x gap that shouldn't exist since Q8_0 only has 1.7x more data. After ruling out VRAM pressure, drivers, and backend issues, we traced it to the SYCL kernel dispatch path.
Root cause:
llama.cpp's SYCL backend has a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This was implemented for Q4_0, Q4_K, and Q6_K - but Q8_0 was never added. Q8_0's 34-byte blocks (not power-of-2) make the non-reordered layout especially bad for GPU cache performance.
Sooo, the fix:
~200 lines of code extending the existing reorder framework to Q8_0. The most critical bug was actually a single line - Q8_0 tensors weren't getting the "extra" struct allocated during buffer init, so the reorder flag was silently never set.
Results on Qwen3.5-27B (Intel Arc Pro B70):
- Q8_0 before: 4.88 t/s (21% bandwidth)
- **Q8_0 after: 15.24 t/s (66% bandwidth) - 3.1x faster*\*
- Q4_K_M: 20.12 t/s (unchanged)
- Q6_K: 13.83 t/s (no reorder)
Q8_0 is now faster than Q6_K (15.24 vs 13.83 t/s) in my testing; while providing higher quality.
Validation: Before writing the fix, we binary-patched Intel's closed-source IPEX-LLM to run on my GPU (it doesn't support B70's PCI device ID). Their optimized Q8_0 kernels achieved 61% bandwidth, confirming the problem was solvable. My open-source implementation achieves 66%.
PR: https://github.com/ggml-org/llama.cpp/pull/21527
Issue: https://github.com/ggml-org/llama.cpp/issues/21517
Hardware: Intel Arc Pro B70, 32 GB GDDR6, 608 GB/s bandwidth
[link] [comments]




