Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.
Tested using am17an's MTP GGUF, q8_0 kv cache and 200k cache limit acting as vscode copilot.
29-30 t/s without MTP
54-55t/s with MTP, using 150W power limit on the card.
Falls to 40-45 t/s after choking down 50k tokens, but doing great with tool calls, sub agents, and made some very insightful code reviews and refactors.
Thank you am17an! Can't wait to see this branch mature, this is great stuff.
[link] [comments]



