vLLM Just Merged TurboQuant Fix for Qwen 3.5+

Reddit r/LocalLLaMA / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • vLLM has merged a “TurboQuant” fix aimed at resolving a prior “Not Implemented” error related to Mamba layers when running Qwen 3.5+.
  • Initial testing indicates the fix works with Qwen 3.6 as well (tested on the 27B model).
  • Users can enable the feature by passing `--kv-cache-dtype turboquant_4bit_nc`, with several other TurboQuant KV-cache dtype options available.
  • If running with `--enable-chunked-prefill`, a mamba alignment warning can be addressed by increasing batched tokens (e.g., setting `--max-num-batched-tokens 4096`).

Previously it was throwing a 'Not Implemented' error due to Mamba layers. Going to test it now!

https://github.com/vllm-project/vllm/pull/39931

Edit: Works with Qwen 3.6, tested with 27B
Can be used with argument;

--kv-cache-dtype turboquant_4bit_nc 

Other available options;

  • turboquant_k8v4
  • turboquant_4bit_nc
  • turboquant_k3v4_nc
  • turboquant_3bit_nc

When running with --enable-chunked-prefill it complained about mamba align, you just need to have more batched tokens than the value that error gives. I used 4096 to fix. --max-num-batched-tokens 4096

submitted by /u/havenoammo
[link] [comments]