Can't replicate Reddit numbers with Qwen 27B on a 3090TI.

Reddit r/LocalLLaMA / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The author attempts to reproduce high Qwen 3.6 27B token-generation speeds on an RTX 3090 Ti but sees much lower throughput (about 10 tok/s with llama.cpp and 18–19 tok/s with an alternative GGUF), even when keeping everything in VRAM.
A log analysis shared via Claude Sonnet claims the slowdown is caused by CPU-side computation each token step (graph splits = 2), where the SSM recurrence state update is executed on the host rather than fully on GPU.
The analysis argues that this CPU-host bottleneck is intrinsic to Qwen’s hybrid SSM architecture and cannot be fixed purely by configuration flags or memory placement.
It further claims the “HAVE_FANCY_SIMD” path (which relies on AVX-VNNI/AVX-512-class capabilities for faster dequantization) cannot run on the author’s i9-9900K, making the CPU performance ceiling much lower.
The author asks others to confirm whether the explanation is correct, suggesting online benchmark differences may be due to newer CPUs that support the needed SIMD extensions.

I feel like i'm going insane. I see people here posting 30 - 100+ tok/s (100+ being with speculative decoding) on a 3090 with Qwen 3.6 27B. I'm trying to replicate this but my performance numbers are nowhere near that.

I have tried llama.cpp with Unsloth's Q4XL and Q4_K_M GGUF's. On that i got like 10 tok/s at 50k context. I also tried using ik_llama.cpp with this smaller gguf: https://huggingface.co/sokann/Qwen3.6-27B-GGUF-5.076bpw which is about 1GB smaller than Unlosth's GGUF and with that combination i get about 18-19 tok/s on 50k context. (Edit: Everything in VRAM with both setups by the way)

I put my ik_llama.cpp logs into Claude Sonnet 4.6 and it's telling me this:

Look at the two compute buffers in your log:

CUDA0 compute buffer size = 2020.00 MiB ← GPU
CUDA_Host compute buffer size = 552.05 MiB ← CPU
graph splits = 2

**`graph splits = 2` means every single generated token requires:**

**Sync to CPU** → CPU computes the SSM recurrence state update (552 MiB of CPU-side work)

Sync back to GPU → GPU finishes

The `CUDA_Host compute buffer` is not just memory for data transfer — it's an actual compute buffer where the CPU executes operations at every token step. The SSM state recurrence (`ssm_d_state = 128`, `ssm_d_inner = 6144`) mathematically cannot be expressed as a static CUDA graph and must be done sequentially on the CPU side.

This is why `HAVE_FANCY_SIMD` matters: that 552 MiB of CPU work per token uses dequantization kernels (`iq4_ks`, `q6_0`) that are dramatically faster with AVX-VNNI/AVX-512. Without them, the CPU portion is the bottleneck, not the GPU.

The model is "fully on GPU" in terms of **weights storage**, but **not** in terms of **computation at generation time**. This is a fundamental property of the Qwen3.6 hybrid SSM architecture — it can't be fixed with any flag or placement trick.

You have an i9-9900K — that's a Coffee Lake (2018) CPU. It supports AVX2 and FMA but not AVX-512 and not AVX-VNNI (those came with Ice Lake / Alder Lake and later). So yes, confirmed — your CPU architecturally cannot run the HAVE_FANCY_SIMD path.

The 18-19 t/s you're getting is the realistic ceiling for this CPU + SSM hybrid model combination.