Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Reddit r/LocalLLaMA / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A user reports running Qwen 3.6 27B on a single NVIDIA V100 32GB SXM card (via a PCIe adapter) using am17an’s MTP branch of llama.cpp, with the build and llama-server working smoothly.
  • In their tests with the MTP GGUF (q8_0), KV cache enabled, and a 200k cache limit (configured like a VS Code Copilot use case), throughput improves from about 29–30 t/s to 54–55 t/s when MTP is enabled under a 150W power limit.
  • After the model “chokes” down around 50k tokens, speed drops to roughly 40–45 t/s, but the system remains effective for tool calls, sub-agent behavior, and code review/refactoring tasks.
  • The post credits am17an and expresses excitement about the MTP branch maturing, indicating promising local inference performance gains for users with V100-class hardware.

Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.

Tested using am17an's MTP GGUF, q8_0 kv cache and 200k cache limit acting as vscode copilot.

29-30 t/s without MTP

54-55t/s with MTP, using 150W power limit on the card.

Falls to 40-45 t/s after choking down 50k tokens, but doing great with tool calls, sub agents, and made some very insightful code reviews and refactors.

Thank you am17an! Can't wait to see this branch mature, this is great stuff.

submitted by /u/m94301
[link] [comments]