Vulkan backend much easier on the CPU and GPU memory than CUDA.

Reddit r/LocalLLaMA / 4/2/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • A user running llama.cpp locally observed that using the CUDA backend for Qwen3.5-9B-GGUF pegged one CPU core at 100% and consumed 11GB+ of GPU memory, though throughput stayed around ~30 tokens/second.
  • Switching the same setup to the Vulkan backend reduced CPU usage to ~30% on a single core and lowered GPU memory usage to about 7.2GB, with reported speed unchanged at ~30 tokens/second.
  • The article is a troubleshooting/curiosity post asking why Vulkan has a lower GPU memory footprint and CPU load than CUDA in this specific local-inference scenario.
  • The main takeaway is an anecdotal performance/resource-usage comparison between Vulkan and CUDA backends in llama.cpp for the same model and hardware.
  • The practical implication is that developers/operators may see different CPU/GPU memory behavior depending on the selected backend, even when throughput appears similar.

On linux and compiled my own llama.cpp with CUDA support, top would always show one pegged CPU core at 100% when running Qwen3.5-9B-GGUF:Q4_K_M on my potato like RTX A2000 12GB. Also, nvidia-smi would show 11GB+ of GPU memory usage. Speed is ~30 tokens per second. My system fans would spin up when this single core gets pegged which was annoying to listen to.

Decided to compile llama.cpp again with Vulkan backend to see if anything would be different. Well it was a big difference when using the exact same model Now, top is only showing one CPU core at about 30% usage and nvidia-smi is only showing 7.2GB of GPU memory usage. Speed is the same at ~30 tokens per second. No longer have my system fan spinning up when running inferencing.

Just curious why the GPU memory footprint is lower and CPU usage is lower when using Vulkan vs CUDA.

submitted by /u/Im_Still_Here12
[link] [comments]