Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Reddit r/LocalLLaMA / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A user reports running Qwen 3.6 27B on a single NVIDIA V100 32GB SXM card (via a PCIe adapter) using am17an’s MTP branch of llama.cpp, with the build and llama-server working smoothly.
In their tests with the MTP GGUF (q8_0), KV cache enabled, and a 200k cache limit (configured like a VS Code Copilot use case), throughput improves from about 29–30 t/s to 54–55 t/s when MTP is enabled under a 150W power limit.
After the model “chokes” down around 50k tokens, speed drops to roughly 40–45 t/s, but the system remains effective for tool calls, sub-agent behavior, and code review/refactoring tasks.
The post credits am17an and expresses excitement about the MTP branch maturing, indicating promising local inference performance gains for users with V100-class hardware.

Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.

Tested using am17an's MTP GGUF, q8_0 kv cache and 200k cache limit acting as vscode copilot.

29-30 t/s without MTP

54-55t/s with MTP, using 150W power limit on the card.

Falls to 40-45 t/s after choking down 50k tokens, but doing great with tool calls, sub agents, and made some very insightful code reviews and refactors.

Thank you am17an! Can't wait to see this branch mature, this is great stuff.

submitted by /u/m94301
[link] [comments]

Black Hat USA

AI Business

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

Dev.to

AI is getting better at doing things, but still bad at deciding what to do?

Reddit r/artificial

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Key Points

Related Articles

Black Hat USA

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

AI is getting better at doing things, but still bad at deciding what to do?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer