Vulkan backend much easier on the CPU and GPU memory than CUDA.

Reddit r/LocalLLaMA / 4/2/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

A user running llama.cpp locally observed that using the CUDA backend for Qwen3.5-9B-GGUF pegged one CPU core at 100% and consumed 11GB+ of GPU memory, though throughput stayed around ~30 tokens/second.
Switching the same setup to the Vulkan backend reduced CPU usage to ~30% on a single core and lowered GPU memory usage to about 7.2GB, with reported speed unchanged at ~30 tokens/second.
The article is a troubleshooting/curiosity post asking why Vulkan has a lower GPU memory footprint and CPU load than CUDA in this specific local-inference scenario.
The main takeaway is an anecdotal performance/resource-usage comparison between Vulkan and CUDA backends in llama.cpp for the same model and hardware.
The practical implication is that developers/operators may see different CPU/GPU memory behavior depending on the selected backend, even when throughput appears similar.

On linux and compiled my own llama.cpp with CUDA support, top would always show one pegged CPU core at 100% when running Qwen3.5-9B-GGUF:Q4_K_M on my potato like RTX A2000 12GB. Also, nvidia-smi would show 11GB+ of GPU memory usage. Speed is ~30 tokens per second. My system fans would spin up when this single core gets pegged which was annoying to listen to.

Decided to compile llama.cpp again with Vulkan backend to see if anything would be different. Well it was a big difference when using the exact same model Now, top is only showing one CPU core at about 30% usage and nvidia-smi is only showing 7.2GB of GPU memory usage. Speed is the same at ~30 tokens per second. No longer have my system fan spinning up when running inferencing.

Just curious why the GPU memory footprint is lower and CPU usage is lower when using Vulkan vs CUDA.

submitted by /u/Im_Still_Here12
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Reddit r/LocalLLaMA

Show Dev: Here's how we made AI 2x faster at integrating APIs

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Vulkan backend much easier on the CPU and GPU memory than CUDA.

Key Points

Related Articles

Black Hat USA

Black Hat Asia

Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Show Dev: Here's how we made AI 2x faster at integrating APIs

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer