Llama.cpp quantization is broken

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The post argues that quantization quality in llama.cpp heavily impacts real-world model performance and stability, not just benchmark numbers.
It claims that several standard low-bit quantizations (roughly Q1–Q4, and even some Q4 variants) produce consistently worse outputs, including hallucinations, looping, and other “bugged” behaviors.
As an example, it compares GRM-2.6-Plus with an AutoRound-derived Q2_K_Mixed version of Qwen3.6-27B, stating that AutoRound performs better in practice despite similar size.
The author advocates adopting AutoRound quantization as the default approach for lower quants, and suggests that approaches like Intel AutoRound (and possibly alternatives) yield more consistent results.
They note that many Q5-6 quantizations may also be inadequate for Qwen models unless more intelligent quantization mechanisms are used, and they request other methods that maintain consistent behavior.

Main reason is, that qunatization quality directly affects models performance and stability and this results in real usefullness. Even though GRM-2.6-Plus is in benchmarks better than qwen3.6 27b model from which it derives, it gives worse results than autoround Q2_K_mixed quant of qwen3.6 27b which is practically same in size.

This is just one example, most of the quants i tested suffer from same problems and only few of them mostly with different quantization mechanism are usefull below Q5.

I want to advocate for autoround quantization as standard for lower quants Q1-Q4, also apex was performing quite well, but size is larger, maybe you know of other alternative methods that give consistent results, because standard quants like Q4_K_M dont provide adequate results and often results in bugged behavior overall (looping, halucinations, inconsistency).

Prompt: Create svg image of a pelican riding a bicycle

Multiple examples of different quant results

https://www.reddit.com/r/LocalLLaMA/comments/1szp96f/comment/oj3r4b1/

Autoround Q2_K_Mixed https://huggingface.co/sphaela/Qwen3.6-27B-AutoRound-GGUF

https://preview.redd.it/mn93lh9bz2zg1.png?width=875&format=png&auto=webp&s=fb39e93521c5f382c6438308e0f07fff21bb05d9

Regular llama.cpp Q4_K_M https://huggingface.co/morikomorizz/GRM-2.6-Plus-GGUF

https://preview.redd.it/b0gigcm7z2zg1.png?width=700&format=png&auto=webp&s=aa826be7b07e2b4ef9a89bbea3443f992d3c41c3

This is just one example and the output quality is consistently worse, when i ask it tricky questions, how much it hallucinates, loops etc.

Community should understand, that typical quantization under Q5-6 is inadequate for qwen models unless you tinker with it through some more intelligent mechanism like intel autoround does.

Looping from my experience is for example direct symptom of broken quantization, occasional syntactic errors in agentic coding another.

submitted by /u/Ok-Importance-3529
[link] [comments]

Black Hat USA

AI Business

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Llama.cpp quantization is broken

Key Points

Related Articles

Black Hat USA

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer