Hi there, first of all I just want to give a huge thanks for Unsloth's tireless work at producing high quality GGUFs and also for their friendly interaction with us here.
I'm just running on a CPU-only setup with the latest llama.cpp on Debian 13. For some reason on my setup the Unsloth GGUFs get about 30% less tokens/sec than a similarly sized one from another creator, and followup responses take quite a bit longer to process.
- Qwen3.6-35B-A3B-UD-IQ4_NL (18.0 GB) [Unsloth]
- Initial response: 6.14 t/s
- First followup response delay: 25 seconds
- Qwen_Qwen3.6-35B-A3B-IQ4_NL (19.9 GB)
- Initial response: 8.71 t/s
- First followup response delay: 14 seconds
- Qwen3.6-35B-A3B-UD-IQ4_XS (17.7 GB) [Unsloth]
- Initial response: 5.91 t/s
- First followup response delay: 29 seconds
- Qwen_Qwen3.6-35B-A3B-IQ4_XS (18.8 GB)
- Initial response: 8.75 t/s
- First followup response delay: 20 seconds
So maybe there's some room for optimization. Although the difference isn't massive, it's noticeable, probably a bit more so on a CPU-only setup. Here's a bit of the llama.cpp output. Hope this helps!
llama-server --reasoning off -m ~/Desktop/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf load_backend: loaded RPC backend from /home/myself/Desktop/llama-b8833/libggml-rpc.so load_backend: loaded CPU backend from /home/myself/Desktop/llama-b8833/libggml-cpu-haswell.so main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true build_info: b8833-45cac7ca7 system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | Running without SSL init: using 11 threads for HTTP server start: binding port with default address family main: loading model srv load_model: loading model '/home/myself/Desktop/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf' common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on llama_params_fit_impl: no devices with dedicated memory found llama_params_fit: successfully fit params to free device memory llama_params_fit: fitting params to free memory took 0.57 seconds [link] [comments]




