TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti

Reddit r/LocalLLaMA / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A local LLM user reports building a llama.cpp fork that introduces a TurboQuant-inspired 3.5-bit weight quantization format (TQ3_1S) with Walsh–Hadamard rotation, 8-centroid quantization, dual half-block scales, and CUDA runtime support.
On Qwen3.5-27B (wiki.test.raw), TQ3_1S achieves near-Q4_0 quality with only ~0.0139 PPL (~0.19%) gap versus Q4_0, indicating the weight-quantization geometry can preserve performance.
The new format reduces model size from ~14.4GB (Q4_0) to ~12.9GB (TQ3_1S), about 10% smaller while staying close in perplexity.
The practical result is that the 27B model fully fits on a 16GB RTX 5060 Ti with TQ3_1S, whereas the Q4_0 variant does not fit under the same setup.
The author positions this as a narrower, practical improvement (near-Q4_0 quality at smaller size) rather than a claim of universally better-than-Q4_0 quantization.

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti

I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw.

I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago.

I just wanted something better than the usual Q3-class compromise (see my first post for benchmark). Many times, I like to buy 24gb card but looking at the price, I quickly turned away.

When the TurboQuant paper came out, and when some shows memory can be saved in KV, I started wondering whether the same style of idea could help on weights, not just KV/ cache.
P/S. I was nearly got the KV done with cuda support but someone beat me on it.

After many long nights (until 2am) after work, that turned into a llama.cpp fork with a 3.5-bit weight format I’m calling TQ3_1S:

Walsh-Hadamard rotation
8-centroid quantization
dual half-block scales
CUDA runtime support in llama.cpp

This work is inspired by the broader transform-based quantization line, especially RaBitQ-style Walsh-Hadamard rotation ideas and the recent TurboQuant result (Tom). The thing I wanted to test was whether that same geometry could help on weights, not just KV/cache.

Main Result on Qwen3.5-27B

Q4_0: 7.2431 +/- 0.04822
TQ3_1S: 7.2570 +/- 0.04802

That is a gap of only +0.0139 PPL, about 0.19%, on the full wiki.test.raw pass (580 chunks, c=512).

Size

Q4_0: about 14.4 GB
TQ3_1S: about 12.9 GB

So TQ3_1S is about 10% smaller while staying near Q4_0 quality.

The practical point for me is simple:

TQ3_1S fits fully on my 16GB RTX 5060 Ti
Q4_0 does not fit fully on GPU in the same setup

So I’m not claiming “better than Q4_0” in general. I’m claiming something narrower and, I think, useful:

near-Q4_0 quality
materially smaller than Q4_0
enough to make a 27B model practical on a 16GB card

Speed record during perplexity test:
- prompt processing pp512: 130.87 tok/s

- generation tg10: 15.55 tok/s

Caveats

this is the strongest result on the 27B witness, not a blanket claim that plain TQ3 works equally well on every model size
I am pretty new to this, so I may miss a lot of test. I only have one card to test :-)
Be skeptical as I can't believe I publish my own model
the speed story here is mainly a deployment/fit win on this GPU class, not a blanket claim that native TQ3 kernels are always faster than native Q4_0