Consider running a bigger quant if possible

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author suggests that if hardware allows, running a larger quantized model can significantly improve real-world behavior compared with smaller quants.
  • They report that Qwen 3.6 IQ4_XS at 128k context underperformed badly due to looping, formatting mistakes, and incorrect implementations.
  • After switching to the unsloth IQ4_NL_XL (with some VRAM headroom), they found it worked much better for agentic coding tasks.
  • They advise not to judge models purely by tok/s or VRAM fit; instead, measure end-to-end task time and expect that a slightly slower model that finishes correctly can be faster overall.

Just a little reminder that *if* it is possible for you to run bigger quants, do it. I ran Qwen 3.6 IQ4_XS at 128k context was very much disappointed because it would loop, make formatting errors, implement wrong things etc. I had a little bit of headroom and decided to give the new unsloth IQ4_NL_XL a try and what should I say. It works MUCH better for agentic coding. If you are like me and start conservative with your model selection based on what completely fits into vram, it might worsen your experience to a very big degree. Always look out for how long the processing of a task really takes and ignore tok/s for quant comparisons. You get stuff faster done if the slower tok/s model (even with offload) takes less time to complete queries correctly(duh)

submitted by /u/Flashy_Management962
[link] [comments]