Llama.cpp's auto fit works much better than I expected

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author previously believed that with 32GB VRAM, they could only run ~20GB-class quantized models without suffering major slowdowns.
They report that llama.cpp’s `--fit` option allowed them to run Qwen3.6 Q8 with a 256k context, even when the model weights alone exceed their VRAM.
Using a system connected via Oculink (GeForce RTX 5090), they claim performance of about 57 t/s, contrary to their earlier expectations.
The post suggests that `--fit` can make it practical to run larger models than expected, reducing the “VRAM or nothing” assumption for local inference users.

I always thought with 32GB of VRAM, the biggest models I could run were around 20GB, like Qwen3.5 27B Q4 or Q6. I had an impression that everything had to fit in VRAM or I'd get 2 t/s.

Man was I wrong. I just tested Qwen3.6 Q8 with 256k context on llama.cpp, with `--fit` on, the weights alone are bigger than my VRAM, and my 5090 is hooked up via Oculink, but I’m still getting 57 t/s! This is literally magic. If you’ve been stuck in the same boat as me thinking it’s all VRAM or nothing, you should try this now!

submitted by /u/a9udn9u
[link] [comments]

Black Hat USA

AI Business

Free AI Detection app designed specifically for Social Media posts

Reddit r/artificial

Why Your Production LLM Prompt Keeps Failing (And How to Diagnose It in 4 Steps)

Dev.to

How to Build AI-Powered Automation Workflows for Small Businesses — A Developer'

Dev.to

Top 10 AI productivity tools for remote teams in 2024

Dev.to

Llama.cpp's auto fit works much better than I expected

Key Points

Related Articles

Black Hat USA

Free AI Detection app designed specifically for Social Media posts

Why Your Production LLM Prompt Keeps Failing (And How to Diagnose It in 4 Steps)

How to Build AI-Powered Automation Workflows for Small Businesses — A Developer'

Top 10 AI productivity tools for remote teams in 2024

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer