Llama.cpp's auto fit works much better than I expected

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

Key Points

  • The author previously believed that with 32GB VRAM, they could only run ~20GB-class quantized models without suffering major slowdowns.
  • They report that llama.cpp’s `--fit` option allowed them to run Qwen3.6 Q8 with a 256k context, even when the model weights alone exceed their VRAM.
  • Using a system connected via Oculink (GeForce RTX 5090), they claim performance of about 57 t/s, contrary to their earlier expectations.
  • The post suggests that `--fit` can make it practical to run larger models than expected, reducing the “VRAM or nothing” assumption for local inference users.

I always thought with 32GB of VRAM, the biggest models I could run were around 20GB, like Qwen3.5 27B Q4 or Q6. I had an impression that everything had to fit in VRAM or I'd get 2 t/s.

Man was I wrong. I just tested Qwen3.6 Q8 with 256k context on llama.cpp, with `--fit` on, the weights alone are bigger than my VRAM, and my 5090 is hooked up via Oculink, but I’m still getting 57 t/s! This is literally magic. If you’ve been stuck in the same boat as me thinking it’s all VRAM or nothing, you should try this now!

submitted by /u/a9udn9u
[link] [comments]