TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti

Reddit r/LocalLLaMA / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A local LLM user reports building a llama.cpp fork that introduces a TurboQuant-inspired 3.5-bit weight quantization format (TQ3_1S) with Walsh–Hadamard rotation, 8-centroid quantization, dual half-block scales, and CUDA runtime support.
  • On Qwen3.5-27B (wiki.test.raw), TQ3_1S achieves near-Q4_0 quality with only ~0.0139 PPL (~0.19%) gap versus Q4_0, indicating the weight-quantization geometry can preserve performance.
  • The new format reduces model size from ~14.4GB (Q4_0) to ~12.9GB (TQ3_1S), about 10% smaller while staying close in perplexity.
  • The practical result is that the 27B model fully fits on a 16GB RTX 5060 Ti with TQ3_1S, whereas the Q4_0 variant does not fit under the same setup.
  • The author positions this as a narrower, practical improvement (near-Q4_0 quality at smaller size) rather than a claim of universally better-than-Q4_0 quantization.
TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti

I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw.

I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago.

I just wanted something better than the usual Q3-class compromise (see my first post for benchmark). Many times, I like to buy 24gb card but looking at the price, I quickly turned away.

When the TurboQuant paper came out, and when some shows memory can be saved in KV, I started wondering whether the same style of idea could help on weights, not just KV/ cache.
P/S. I was nearly got the KV done with cuda support but someone beat me on it.

After many long nights (until 2am) after work, that turned into a llama.cpp fork with a 3.5-bit weight format I’m calling TQ3_1S:

  • Walsh-Hadamard rotation
  • 8-centroid quantization
  • dual half-block scales
  • CUDA runtime support in llama.cpp

This work is inspired by the broader transform-based quantization line, especially RaBitQ-style Walsh-Hadamard rotation ideas and the recent TurboQuant result (Tom). The thing I wanted to test was whether that same geometry could help on weights, not just KV/cache.

Main Result on Qwen3.5-27B

  • Q4_0: 7.2431 +/- 0.04822
  • TQ3_1S: 7.2570 +/- 0.04802

That is a gap of only +0.0139 PPL, about 0.19%, on the full wiki.test.raw pass (580 chunks, c=512).

Size

  • Q4_0: about 14.4 GB
  • TQ3_1S: about 12.9 GB

So TQ3_1S is about 10% smaller while staying near Q4_0 quality.

The practical point for me is simple:

  • TQ3_1S fits fully on my 16GB RTX 5060 Ti
  • Q4_0 does not fit fully on GPU in the same setup

So I’m not claiming “better than Q4_0” in general. I’m claiming something narrower and, I think, useful:

  • near-Q4_0 quality
  • materially smaller than Q4_0
  • enough to make a 27B model practical on a 16GB card

Speed record during perplexity test:
- prompt processing pp512: 130.87 tok/s

- generation tg10: 15.55 tok/s

Caveats

  • this is the strongest result on the 27B witness, not a blanket claim that plain TQ3 works equally well on every model size
  • I am pretty new to this, so I may miss a lot of test. I only have one card to test :-)
  • Be skeptical as I can't believe I publish my own model
  • the speed story here is mainly a deployment/fit win on this GPU class, not a blanket claim that native TQ3 kernels are always faster than native Q4_0

Links

I will open source the quantization steps when I have enough feedback and test.

Update: Since a few saying I only compare to q4_0. Here is update. TQ3_4S will be published with faster processing speed

Format bpw PPL (c=2048) Size
TQ3_4S 4.00 6.7727 12.9 GB
Q3_K_S 3.44 6.7970 11.4 GB
IQ4_XS 4.25 6.8334 13.9 GB
TQ3_1S 4.00 6.9186 12.9 GB
UD-Q2_K_XL 3.30 7.5294 11.0 GB

- u/Imaginary-Anywhere23

submitted by /u/pmttyji
[link] [comments]