Seeking resources to read about llama.cpp server and how offloading works

Reddit r/LocalLLaMA / 5/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A user testing llama.cpp server with Qwen3 Coder Next describes offloading behavior on a machine with 32GB VRAM, comparing token-per-second performance across quantization levels.
  • They report that offloading many layers still degrades speed and they only partially observe expected RAM/swap usage, suggesting hidden or misunderstood resource management.
  • The user is puzzled by how llama.cpp appears to handle large memory components—especially with a large KV cache context (e.g., 120k) while monitoring shows low RAM/swap.
  • They ask where to find documentation/resources explaining llama.cpp’s “magic” and how offloading works, and whether desktop RAM monitoring tools (KDE5 widget) are missing activity.
  • The post frames the question as seeking an explanation of system resource handling on Kubuntu 24.04 rather than a new release or tool update.

First of all, I'm greatly impressed by how llama-cpp server handles offloading. There's some fucking magic happening here, at least to me.

I have 32gb of VRAM so loading in the small models is no problem, but now I'm starting to experiment with models that spill into system RAM, testing tok/sec differences and various quants.

I'm currently testing Qwen3 Coder Next. At Q4-KM, this thing weighs in at 45gb in size. I can make that one work, but the more offloading I do, the slower it is (obviously). Thus, I am currently however testing the smaller 4-bit quant, IQ4_XS at 36gb trying to find the middle ground before quality starts to suffer.

If I offload 36 layers, it fills my vram 30/32gb. Tok / sec is around 25, which for an MoE model is not great at all - at least I don't think it is. I tried the 3-bit quant which fits fully in memory, but after multiple quality issues, I gave up on it. I think for large models and coding, 3-bit is just too much compression, or at least it feels like it. (anyone else have this impression? or is it just me?)

Anyways - to my actual question - how the hell does llama-cpp do this magic? I am monitoring RAM usage and swap file and neither of them are very high, yet I only have 30gb loaded out of this model, including 120k unquantized KV cache context... It's basically impossible, so clearly I am missing something about how Kubuntu 24.04 manages system resources.

Is my KDE5 widget for RAM not capturing what llama-cpp is up to? I'd like to read up on how it works or if someone can explain it to my dumb ass, I'd greatly appreciate it lol.

submitted by /u/Jorlen
[link] [comments]