The TheTom's turboquant's GPU accelerated turboquant (turbo3) has unlocked high context gains for the 35BA3B family.
I can now achieve ~40tg/s via the following GPU-POOR compilation flags and configuration:
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FORCE_MMQ=ON ./local/bin/llama-cpp-turboquant/llama-server \ --alias 'Qwen3-6-35B-A3B-turbo' \ --ctx-size 0 \ --fit on \ --no-mmproj \ --jinja \ --flash-attn on \ --cache-type-k turbo3 \ --cache-type-v turbo3 \ --reasoning off \ -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 This is using the qwen3.6 recommended settings for thinking off, as I find the time-to-first-acceptable-solution is better with a prompt harness that has stages: ask, validate, review, refine/accept.
[link] [comments]




