AI Navigate

(Sharing Experience) Qwen3.5-122B-A10B does not quantize well after Q4

Reddit r/LocalLLaMA / 3/16/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The author tested Qwen3.5-122B-A10B on 48GB VRAM hoping to replicate Qwen3.5 27B's performance with 2-3x faster inference and lower context memory usage.
  • They report that Q4+ on 122B performed well, but heavy CPU offload prevented beating 27B's TG speeds and significantly lagged in PP speeds.
  • They tried Q3_K_M with some CPU offload and UD_Q2_K_XL for 100% in-VRAM, hoping quantization would work for large models as it had for models >100B in the past.
  • In practice the quantization increased speeds but degraded overall performance, making it unsuitable for their codebases and showing a large performance drop versus Q4.
  • The author shares this experience to help others gauge whether heavy quantization is worth trying before they invest time and effort.

Just a report of my own experiences:

I've got 48GB of VRAM. I was excited that Qwen3.5-122B-A10B looked like a way to get Qwen3.5 27B's performance at 2-3x the inference speed with much lower memory needs for context. I had great experiences with Q4+ on 122B, but the heavy CPU offload meant I rarely beat 27B's TG speeds and significantly fell behind in PP speeds.

I tried Q3_K_M with some CPU offload and UD_Q2_K_XL for 100% in-VRAM. With models > 100B total params I've had success in the past with this level of quantization so I figured it was worth a shot.

Nope.

The speeds I was hoping for were there (woohoo!) but it consistently destroys my codebases. It's smart enough to play well with the tool-calls and write syntactically-correct code but cannot make decisions to save its life. It is an absolute cliff-dive in performance vs Q4.

Just figured I'd share as everytime I explore heavily quantized larger models I'll always search to see if others have tried it first.

submitted by /u/EmPips
[link] [comments]