Qwen3.6-35B-A3B GGUF from Unsloth is quite a bit slower?

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A Reddit user reports that Unsloth-produced GGUFs for Qwen3.6-35B run noticeably slower on a CPU-only Debian 13 setup with llama.cpp, yielding about 30% fewer tokens per second than similar models from other creators.
  • They observe longer latency for follow-up responses with the Unsloth GGUFs (e.g., ~25–29 seconds vs ~14–20 seconds in the reported comparisons), suggesting performance differences beyond just initial tokens.
  • The comparison includes two quantization variants (IQ4_NL) and two model sizes within the Qwen3.6-35B family, where Unsloth builds show lower tokens/sec (about ~5.9–6.1 t/s) than the alternatives (~8.7–8.8 t/s).
  • The user suggests there may be room for optimization and shares a snippet of llama.cpp startup logs (CPU backend, n_parallel auto→4, thread counts, and model loading) as context for troubleshooting.
  • The claim is framed as potentially configuration/model-build dependent, and the user invites others to reproduce/verify the slowdown and look for causes in the GGUF generation or runtime settings.

Continue reading this article on the original site.

Read original →