My llama.cpp is compiled with CUDA support, OpenBLAS and AVX512. As I'm experimenting, I'm trying to have inference happen purely on the CPU for now.
-ngl 0 seems to still make use of the GPU, as I see a spike in GPU processor and RAM usage (using nvtop) when loading the model via llama-cli
How can one explain that?
[link] [comments]




