Qwen3.6-35B-A3B GGUF from Unsloth is quite a bit slower?

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A Reddit user reports that Unsloth-produced GGUFs for Qwen3.6-35B run noticeably slower on a CPU-only Debian 13 setup with llama.cpp, yielding about 30% fewer tokens per second than similar models from other creators.
  • They observe longer latency for follow-up responses with the Unsloth GGUFs (e.g., ~25–29 seconds vs ~14–20 seconds in the reported comparisons), suggesting performance differences beyond just initial tokens.
  • The comparison includes two quantization variants (IQ4_NL) and two model sizes within the Qwen3.6-35B family, where Unsloth builds show lower tokens/sec (about ~5.9–6.1 t/s) than the alternatives (~8.7–8.8 t/s).
  • The user suggests there may be room for optimization and shares a snippet of llama.cpp startup logs (CPU backend, n_parallel auto→4, thread counts, and model loading) as context for troubleshooting.
  • The claim is framed as potentially configuration/model-build dependent, and the user invites others to reproduce/verify the slowdown and look for causes in the GGUF generation or runtime settings.

Hi there, first of all I just want to give a huge thanks for Unsloth's tireless work at producing high quality GGUFs and also for their friendly interaction with us here.

I'm just running on a CPU-only setup with the latest llama.cpp on Debian 13. For some reason on my setup the Unsloth GGUFs get about 30% less tokens/sec than a similarly sized one from another creator, and followup responses take quite a bit longer to process.


  • Qwen3.6-35B-A3B-UD-IQ4_NL (18.0 GB) [Unsloth]
    • Initial response: 6.14 t/s
    • First followup response delay: 25 seconds
  • Qwen_Qwen3.6-35B-A3B-IQ4_NL (19.9 GB)
    • Initial response: 8.71 t/s
    • First followup response delay: 14 seconds

  • Qwen3.6-35B-A3B-UD-IQ4_XS (17.7 GB) [Unsloth]
    • Initial response: 5.91 t/s
    • First followup response delay: 29 seconds
  • Qwen_Qwen3.6-35B-A3B-IQ4_XS (18.8 GB)
    • Initial response: 8.75 t/s
    • First followup response delay: 20 seconds

So maybe there's some room for optimization. Although the difference isn't massive, it's noticeable, probably a bit more so on a CPU-only setup. Here's a bit of the llama.cpp output. Hope this helps!

llama-server --reasoning off -m ~/Desktop/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf load_backend: loaded RPC backend from /home/myself/Desktop/llama-b8833/libggml-rpc.so load_backend: loaded CPU backend from /home/myself/Desktop/llama-b8833/libggml-cpu-haswell.so main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true build_info: b8833-45cac7ca7 system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | Running without SSL init: using 11 threads for HTTP server start: binding port with default address family main: loading model srv load_model: loading model '/home/myself/Desktop/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf' common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on llama_params_fit_impl: no devices with dedicated memory found llama_params_fit: successfully fit params to free device memory llama_params_fit: fitting params to free memory took 0.57 seconds 
submitted by /u/Quagmirable
[link] [comments]