| moonshotai/Kimi-K2.6 int4 @ 9.7 tok/s (output of 136 tok) and 263 tok/s (input of 14564 tok) on vllm-gfx906-mobydick Github link of vllm fork: https://github.com/ai-infos/vllm-gfx906-mobydick Power draw: ~640W (idle) / ~4800W (peak inference) Is it worth ? No, unless you’ve got solar panels or free energy… Setup details: https://github.com/ai-infos/guidances-setup-16-mi50-deepseek-v32 cmd i run: the script "openai_server_kimi.py" is just based on official vllm example with torchrun (modified to support openai api..and not really optimized... the vllm default command that included torchrun didn't work for me, need more investigation to debug...), i can share it on github too if there's any interest (but need to be more optimized) ps: I still didn’t do a full guidance setup for this because i’m quite not satisfied of the perf… First, this setup run at pcie gen3 x8 and pcie gen4 x4 , all are supposed to be at 7GB/s but got one at 3.5GB/s (due to instability of risers…) Theoretically, if i manage to do a new setup with max pcie bandwidth : 28GB/s (if x16) or 14GB/s (if x8) in TP8 PP4 (or TP4 PP8) and with optimized vllm software stack, I believe we can jum to 600-1000 PP and 9-12 TG (without mtp)… and now this setup might be interesting if we compare to hybrid setup (ddr5-rtx 6000 pro, etc) but i think i’m done with all of it and I might just enjoy small models, much faster on smaller setups. Feel free to ask any questions and/or share any comments. [link] [comments] |
Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6
Reddit r/LocalLLaMA / 5/1/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The post describes a large local inference setup using 32× AMD MI50 32GB GPUs running the Kimi K2.6 model with int4, reporting about 9.7 tok/s (TG) and 263 tok/s (PP).
- It claims the benchmarks were run on a vLLM fork (“vllm-gfx906-mobydick”) and provides a GitHub link to the fork.
- The author reports power consumption of roughly ~640W idle and ~4800W at peak inference, and states it is generally not worth it unless you have solar panels or free electricity.
- The configuration uses two nodes of 16 GPUs connected via 10G Ethernet, and includes specific environment variables and torchrun distributed commands used to start the OpenAI-compatible server.
- The author notes the setup was not fully optimized with a full “guidance setup” and that performance may be limited (e.g., running at PCIe).
Related Articles

Black Hat USA
AI Business

Builder Platforms Fail at Production. Here's What Changed for Us with Nometria
Dev.to

A beginner's guide to the Gemini-2.5-Flash model by Google on Replicate
Dev.to

Hugging Face 'Spaces' now acts as an MCP-App-Store. Anybody thinking on the security consequence?
Dev.to

8 AI Prompts That Win Freelance Clients (Copy-Paste Ready for 2026)
Dev.to