What speed is everyone getting on Qwen3.6 27b?

Reddit r/LocalLLaMA / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A Reddit user reports achieving about 13 tokens per second on Qwen3.6 27B using the Q8_0 quantization with a 128k context window while running via llama.cpp.
  • The setup described uses three GPUs (1× RTX 2060 Super 8GB and 2× RTX 5060 Ti 16GB) and specific llama-server launch parameters including K/V cache set to Q8_0.
  • The user shares their configuration details (temperature, top-p/top-k, penalties, and cache settings) and asks whether this throughput is slower than expected.
  • They note that the --fit-target value of 1536 was chosen to leave room for the model’s vision capability to function.
  • Overall, the post is a community performance benchmark/feedback request focused on local LLM inference speed expectations for Qwen3.6 27B.

I'm getting ~13 tps on Q8_0, with a context window of 128000, K Q8_0, V Q8_0

this is on 3x GPUS (1x2060super 8gb, 2x5060ti 16gb), via llamacpp

unsure if this is slow or to be expected?

*/llama-server --port 8080 --model */llama.cpp/Qwen3.6-27B-Q8_0/Qwen3.6-27B-Q8_0.gguf -mm */Qwen3.6-27B-Q8_0/mmproj-BF16.gguf -np 1 --temperature 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking": true}' --cache-type-k q8_0 --cache-type-v q8_0 -c 128000 --fit-target 1536

(--fit-target 1536 was to allow some space for the vision capability to work)

submitted by /u/Ambitious_Fold_2874
[link] [comments]