Seeking resources to read about llama.cpp server and how offloading works

Reddit r/LocalLLaMA / 5/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A user testing llama.cpp server with Qwen3 Coder Next describes offloading behavior on a machine with 32GB VRAM, comparing token-per-second performance across quantization levels.
They report that offloading many layers still degrades speed and they only partially observe expected RAM/swap usage, suggesting hidden or misunderstood resource management.
The user is puzzled by how llama.cpp appears to handle large memory components—especially with a large KV cache context (e.g., 120k) while monitoring shows low RAM/swap.
They ask where to find documentation/resources explaining llama.cpp’s “magic” and how offloading works, and whether desktop RAM monitoring tools (KDE5 widget) are missing activity.
The post frames the question as seeking an explanation of system resource handling on Kubuntu 24.04 rather than a new release or tool update.

First of all, I'm greatly impressed by how llama-cpp server handles offloading. There's some fucking magic happening here, at least to me.

I have 32gb of VRAM so loading in the small models is no problem, but now I'm starting to experiment with models that spill into system RAM, testing tok/sec differences and various quants.

I'm currently testing Qwen3 Coder Next. At Q4-KM, this thing weighs in at 45gb in size. I can make that one work, but the more offloading I do, the slower it is (obviously). Thus, I am currently however testing the smaller 4-bit quant, IQ4_XS at 36gb trying to find the middle ground before quality starts to suffer.

If I offload 36 layers, it fills my vram 30/32gb. Tok / sec is around 25, which for an MoE model is not great at all - at least I don't think it is. I tried the 3-bit quant which fits fully in memory, but after multiple quality issues, I gave up on it. I think for large models and coding, 3-bit is just too much compression, or at least it feels like it. (anyone else have this impression? or is it just me?)

Anyways - to my actual question - how the hell does llama-cpp do this magic? I am monitoring RAM usage and swap file and neither of them are very high, yet I only have 30gb loaded out of this model, including 120k unquantized KV cache context... It's basically impossible, so clearly I am missing something about how Kubuntu 24.04 manages system resources.

Is my KDE5 widget for RAM not capturing what llama-cpp is up to? I'd like to read up on how it works or if someone can explain it to my dumb ass, I'd greatly appreciate it lol.

submitted by /u/Jorlen
[link] [comments]