Fix: Dual Intel Arc GPUs using all system RAM during inference - found the cause and a working fix (llama.cpp SYCL)

Reddit r/LocalLLaMA / 4/8/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article explains that dual Intel Arc GPUs running llama.cpp with the SYCL backend can trigger extreme system RAM usage even when the model fits in VRAM.
  • The root cause is a specific SYCL call, `sycl::malloc_device()`, which causes Intel’s xe kernel driver to mirror GPU allocations into system RAM via DMA-buf/TTM staging at allocation time.
  • A benchmark test shows roughly equal system RAM impact with `sycl::malloc_device()` (e.g., ~+4,112 MiB for 4 GiB VRAM) versus minimal impact with Level Zero’s `zeMemAllocDevice()` (~+8 MiB for the same 4 GiB VRAM).
  • The proposed fix is to use the Level Zero allocation path (`zeMemAllocDevice()`) instead of `sycl::malloc_device()`, leveraging that SYCL kernels can read the resulting pointers without compatibility issues.
  • Reported symptoms include system RAM reaching 100%, the OOM killer terminating desktop processes, and system instability or login-screen drops during multi-GPU model loading.

If you're running dual Intel Arc GPUs with llama.cpp and your system RAM maxes out during multi-GPU inference, even though the model fits in VRAM, this post explains why and how to fix it.

I've been running dual Arc Pro B70s (32GB each, 64GB total VRAM) for local LLM inference with llama.cpp's SYCL backend. Every time I tried to split a model across both GPUs, my 64GB of system RAM would climb to 100% and the OOM killer would start taking out desktop processes until the system either crashed or dumped me at the login screen. This happened with every model size. A 15 GiB Q4_K_M model was eating 46 GiB of system RAM. It made no sense.

Turns out it's not a configuration issue, not a VRAM issue, and not about model size. It's a specific API call in llama.cpp's SYCL backend that triggers the wrong memory path in Intel's xe kernel driver.

What's actually happening

Every call to sycl::malloc_device() in the SYCL backend causes the xe kernel driver to create a 1:1 mirror of the GPU allocation in system RAM through DMA-buf/TTM staging. This happens at allocation time, not during inference. Every tensor, every KV cache buffer, every compute scratch buffer that gets allocated on the GPU also consumes an equal amount of your system RAM.

I confirmed this with a targeted test:

Allocation Method 4 GiB on GPU System RAM Impact
sycl::malloc_device() 4 GiB VRAM +4,112 MiB system RAM
zeMemAllocDevice() 4 GiB VRAM +8 MiB system RAM

Same VRAM allocation, same GPU, same driver. 500x difference in system RAM usage depending on which API you call.

The xe driver has two internal kernel paths for device memory:

  1. DMA-buf/TTM - mirrors VRAM in system RAM. This is what sycl::malloc_device() triggers.
  2. SVM/P2P - direct PCIe BAR access, virtually no system RAM. This is what Level Zero's zeMemAllocDevice() uses.

SYCL kernels can read zeMemAllocDevice pointers with zero issues. Full interop, no compatibility problems. The only difference is which kernel path gets triggered under the hood.

Symptoms you might recognize

  • System RAM climbs to 100% when loading a model across two GPUs, even though the model fits in VRAM
  • OOM killer starts taking out desktop processes (pipewire, nautilus, wireplumber)
  • System becomes unresponsive or drops you to the login screen
  • Adding swap "helps" but inference gets painfully slow
  • Someone told you that you need 128 GB RAM for dual GPUs
  • Single GPU works fine, dual GPU crashes

The fix

Replace sycl::malloc_device() with zeMemAllocDevice() throughout llama.cpp's SYCL backend. I wrote centralized helper functions with automatic fallback:

static void * ggml_sycl_malloc_device(size_t size, sycl::queue &q) { void *ptr = nullptr; try { auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context()); auto ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_device()); ze_device_mem_alloc_desc_t alloc_desc = {ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC}; ze_result_t r = zeMemAllocDevice(ze_ctx, &alloc_desc, size, 64, ze_dev, &ptr); if (r == ZE_RESULT_SUCCESS && ptr) return ptr; } catch (...) {} return sycl::malloc_device(size, q); // fallback } 

The fix touches 4 files, replaces 3 allocation sites and 3 free sites, and links against ze_loader. If Level Zero interop isn't available for some reason, it falls back to the original sycl::malloc_device behavior automatically.

Before and after

Q4_K_M (15.6 GiB model), 48K context, dual GPU:

Metric Before After
Peak system RAM 60,034 MiB (100%), OOM crash ~6.7 GiB (10%), flat
Prompt processing crash 782 t/s
pp512 speed 348 t/s 359 t/s
tg128 speed 17.92 t/s 17.92 t/s

Q8_0 (26.6 GiB model), 32K context, dual GPU:

Metric Before After
Peak system RAM 100%, OOM crash flat, no issue
Prompt processing crash 915 t/s

System RAM stays flat at around 10% throughout all dual-GPU tests. No OOM, no crashes, no performance regression. Output is byte-for-byte identical between single GPU and dual GPU (verified with seed=42).

Things we tried that didn't work

Before finding the real cause, we spent hours on these. None of them fix the problem:

  • Disabling IOMMU (iommu=off in GRUB) - no effect
  • Direct SYCL device-to-device memcpy (replacing the host bounce buffer) - faster transfers but same RAM usage
  • NEO debug keys (UseKmdMigration=0, etc.) - no effect
  • cgroup memory limits - the TTM allocations happen kernel-side, they're not charged to process cgroups
  • Disabling ACS on PCIe root ports - no effect
  • Level Zero IPC handles (zeMemGetIpcHandle) - these also consume system RAM

The only fix is replacing the allocation function itself.

Why Nvidia and AMD don't have this problem

CUDA and ROCm have their own peer-to-peer memory management that doesn't go through the kernel's generic DMA-buf path. Intel's xe driver actually has a working P2P/SVM path in kernel 7.0+, but sycl::malloc_device() triggers the older DMA-buf export path instead of using it. Intel's own multi-GPU inference stack (llm-scaler, which uses vLLM) avoids this by using Level Zero APIs directly.

System details

  • 2x Intel Arc Pro B70 (32 GB each, Battlemage/Xe2)
  • AMD Ryzen 5 9600X, 64 GB DDR5-4800
  • Ubuntu 26.04, kernel 7.0.0-12-generic, xe driver, compute-runtime 26.09
  • llama.cpp SYCL backend (commit 69c28f1)
  • Display on AMD Radeon iGPU, both B70s are compute-only
  • Model: Qwen3.5-27B (tested Q4_K_M, Q5_K_M, Q6_K, Q8_0)

What's next

I'm planning to submit this as a PR to llama.cpp. If you're hitting this issue and want to fix it locally, I'm happy to share the full patch and test programs.

This probably affects anyone using Intel multi-GPU with any SYCL-based inference engine, not just llama.cpp. The root cause is in how SYCL's allocation function interacts with the xe driver, not in llama.cpp specifically.

I also posted the initial findings on X before we found the fix, if you want to see the real-time investigation.

submitted by /u/Katostrofik
[link] [comments]