Qwen 3.6 q8 at 50t/s or q4 at 112 t/s?

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The post asks how to decide between Qwen 3.6 quantized models running at different speeds (Q8 at 50 t/s versus Q4 at 112 t/s) for use in a local inference harness like pi.
The author reports that Q4 was extremely consistent and reliable in their testing, including running with a 131k context window and surviving two compacting steps on a well-defined task without breaking behavior.
They plan to test Q8 next and want others’ impressions on the expected qualitative differences between Q8 and Q4 in practice.
Overall, the discussion focuses on performance tradeoffs between higher-precision (Q8) and higher-throughput (Q4) settings for long-context, robustness-sensitive runs.

What are some ways that you would go about thinking about choosing between the two for use in a harness like pi?

Did a good bit with q4 yesterday and it was so consistent and reliable I had it set to 131k context and it worked through 2 compactings on a clearly defined task without messing the whole thing up. Very excited about this recent step forward.

I'm going to start working with the q8 some today but I was interested in what your impressions of the types of differences I might expect between the two.

submitted by /u/GotHereLateNameTaken
[link] [comments]