I feel like i'm going insane. I see people here posting 30 - 100+ tok/s (100+ being with speculative decoding) on a 3090 with Qwen 3.6 27B. I'm trying to replicate this but my performance numbers are nowhere near that.
I have tried llama.cpp with Unsloth's Q4XL and Q4_K_M GGUF's. On that i got like 10 tok/s at 50k context. I also tried using ik_llama.cpp with this smaller gguf: https://huggingface.co/sokann/Qwen3.6-27B-GGUF-5.076bpw which is about 1GB smaller than Unlosth's GGUF and with that combination i get about 18-19 tok/s on 50k context. (Edit: Everything in VRAM with both setups by the way)
I put my ik_llama.cpp logs into Claude Sonnet 4.6 and it's telling me this:
Look at the two compute buffers in your log:
CUDA0 compute buffer size = 2020.00 MiB ← GPU
CUDA_Host compute buffer size = 552.05 MiB ← CPU
graph splits = 2**`graph splits = 2` means every single generated token requires:**
**Sync to CPU** → CPU computes the SSM recurrence state update (552 MiB of CPU-side work)
Sync back to GPU → GPU finishes
The `CUDA_Host compute buffer` is not just memory for data transfer — it's an actual compute buffer where the CPU executes operations at every token step. The SSM state recurrence (`ssm_d_state = 128`, `ssm_d_inner = 6144`) mathematically cannot be expressed as a static CUDA graph and must be done sequentially on the CPU side.
This is why `HAVE_FANCY_SIMD` matters: that 552 MiB of CPU work per token uses dequantization kernels (`iq4_ks`, `q6_0`) that are dramatically faster with AVX-VNNI/AVX-512. Without them, the CPU portion is the bottleneck, not the GPU.
The model is "fully on GPU" in terms of **weights storage**, but **not** in terms of **computation at generation time**. This is a fundamental property of the Qwen3.6 hybrid SSM architecture — it can't be fixed with any flag or placement trick.
You have an i9-9900K — that's a Coffee Lake (2018) CPU. It supports AVX2 and FMA but not AVX-512 and not AVX-VNNI (those came with Ice Lake / Alder Lake and later). So yes, confirmed — your CPU architecturally cannot run the HAVE_FANCY_SIMD path.
The 18-19 t/s you're getting is the realistic ceiling for this CPU + SSM hybrid model combination.
Can someone confirm if this is accurate or is it gaslighting me? All the numbers i see online are higher because those people are using newer CPU's?
[link] [comments]




