Qwen3.5-397B is shockingly useful at Q2

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The post reports that Qwen3.5-397B quantized to Q2 (UD_IQ2_M weights, ~122GB on disk) has become surprisingly usable for local inference, contradicting prior experiences where Q2 levels were largely unreliable.
Using a LocalLLaMA-friendly workstation (Ryzen 9 3950X, 96GB DDR4, dual GPUs with 48GB VRAM each, llama.cpp with ROCm), the author observes ~11 tokens/sec generation and up to ~120 tokens/sec prompt processing after warmup.
For output quality, the model is said to perform strongly in coding and knowledge/trivia-style tasks, beating several larger or differently quantized models in the author’s tests.
The author notes limitations: hallucinations can still occur, and running without “reasoning budget” reduces the model’s ability to self-correct, making reasoning tokens advisable.
The takeaway is a practical recommendation to try Qwen3.5-397B at Q2, because it appears to be the best model the author’s system can run and may be broadly helpful to others using similar local setups.

Quick specs, this is a workstation that was morphed into something LocalLLaMa friendly over time:

3950x
96GB DDR4 (dual channel, running at 3000mhz)
w6800 + Rx6800 (48GB of VRAM at ~512GB/s)
most tests done with ~20k context; kv-cache at q8_0
llama cpp main branch with ROCM

The model used was the UD_IQ2_M weights from Unsloth which is ~122GB on disk. I have not had success with Q2 levels of quantization since Qwen3-235B - so I was assuming that this test would be a throwaway like all of my recent tests, but it turns out it's REALLY good and somewhat usable.

For Performance: , after allowing it to warm up (like 2-3 minutes of token gen) I'm getting:

~11 tokens/second token-gen
~43 tokens/second prompt-processing for shorter prompts and about 120t/s longer prompts (I did not record PP speeds on very long agentic workflows to see what caching benefits might look like)

That prompt-processing is a bit under the bar for interactive coding sessions, but for 24/7 agent loops I have it can get a lot done.

For the output quality: It codes incredibly well and is beating Qwen3.5 27B (full), Qwen3.5 122B (Q4), MiniMax M2.5 (Q4) GPT-OSS-120B (full), and Gemma 4 31B (full) in coding and knowledge tasks (I keep a long set of trivia questions that can have different levels of correctness). I can catch hallucinations in the reasoning output (I don't think any Q2 is immune to this) but it quickly steers itself back on course. I had some fun using it without reasoning budget as well - but it cannot correct any hallucinations so I wouldn't advise it to be used without reasoning tokens.

The point of this post: Basically everything Q2 and under I've found to be unusable for the last several months. I wanted to point a few people towards Qwen3.5-397B and recommend giving it a chance. It's suddenly the strongest model my system can run and might be good for you too.

submitted by /u/EmPips
[link] [comments]