llama.cpp -ngl 0 still shows some GPU usage?

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post describes a user compiling llama.cpp with CUDA, OpenBLAS, and AVX512 but attempting to run inference purely on the CPU using the -ngl 0 option.
  • Despite -ngl 0, the user observes GPU activity and increased GPU/RAM usage during model loading in llama-cli, as seen via nvtop.
  • The question focuses on why GPU resources are still utilized even when GPU offloading is ostensibly disabled.
  • The situation suggests potential CUDA-related initialization, partial GPU involvement during startup/loading, or behavior tied to how llama.cpp handles compiled backends.

My llama.cpp is compiled with CUDA support, OpenBLAS and AVX512. As I'm experimenting, I'm trying to have inference happen purely on the CPU for now.

-ngl 0 seems to still make use of the GPU, as I see a spike in GPU processor and RAM usage (using nvtop) when loading the model via llama-cli

How can one explain that?

submitted by /u/sob727
[link] [comments]

llama.cpp -ngl 0 still shows some GPU usage? | AI Navigate