Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post describes a large local inference setup using 32× AMD MI50 32GB GPUs running the Kimi K2.6 model with int4, reporting about 9.7 tok/s (TG) and 263 tok/s (PP).
  • It claims the benchmarks were run on a vLLM fork (“vllm-gfx906-mobydick”) and provides a GitHub link to the fork.
  • The author reports power consumption of roughly ~640W idle and ~4800W at peak inference, and states it is generally not worth it unless you have solar panels or free electricity.
  • The configuration uses two nodes of 16 GPUs connected via 10G Ethernet, and includes specific environment variables and torchrun distributed commands used to start the OpenAI-compatible server.
  • The author notes the setup was not fully optimized with a full “guidance setup” and that performance may be limited (e.g., running at PCIe).
Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6

32 MI50 32GB setup

moonshotai/Kimi-K2.6 int4 @ 9.7 tok/s (output of 136 tok) and 263 tok/s (input of 14564 tok) on vllm-gfx906-mobydick

Github link of vllm fork: https://github.com/ai-infos/vllm-gfx906-mobydick

Power draw: ~640W (idle) / ~4800W (peak inference)

Is it worth ? No, unless you’ve got solar panels or free energy…

Setup details:
That’s just 2 nodes of 16 GPU that i plugged together with 10G cable ethernet. You can find details on 1 node of 16 GPU there:

https://github.com/ai-infos/guidances-setup-16-mi50-deepseek-v32

cmd i run:

NCCL_SOCKET_IFNAME=eno1 GLOO_SOCKET_IFNAME=eno1 PYTHONUNBUFFERED=1 VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=1200 OMP_NUM_THREADS=4 \ FLASH_ATTENTION_TRITON_AMD_REF="TRUE" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG \ python3 -m torch.distributed.run --nnodes=2 --node_rank=0 --nproc_per_node=16 --master_addr=10.0.0.8 --master_port=29500 /llm/models/shared/openai_server_kimi.py 2>&1 | tee log.txt NCCL_SOCKET_IFNAME=eno1 GLOO_SOCKET_IFNAME=eno1 PYTHONUNBUFFERED=1 VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=1200 OMP_NUM_THREADS=4 \ FLASH_ATTENTION_TRITON_AMD_REF="TRUE" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG \ python3 -m torch.distributed.run --nnodes=2 --node_rank=1 --nproc_per_node=16 --master_addr=10.0.0.8 --master_port=29500 /llm/models/shared/openai_server_kimi.py 2>&1 | tee log.txt 

the script "openai_server_kimi.py" is just based on official vllm example with torchrun (modified to support openai api..and not really optimized... the vllm default command that included torchrun didn't work for me, need more investigation to debug...), i can share it on github too if there's any interest (but need to be more optimized)

ps: I still didn’t do a full guidance setup for this because i’m quite not satisfied of the perf… First, this setup run at pcie gen3 x8 and pcie gen4 x4 , all are supposed to be at 7GB/s but got one at 3.5GB/s (due to instability of risers…) Theoretically, if i manage to do a new setup with max pcie bandwidth : 28GB/s (if x16) or 14GB/s (if x8) in TP8 PP4 (or TP4 PP8) and with optimized vllm software stack, I believe we can jum to 600-1000 PP and 9-12 TG (without mtp)… and now this setup might be interesting if we compare to hybrid setup (ddr5-rtx 6000 pro, etc) but i think i’m done with all of it and I might just enjoy small models, much faster on smaller setups.

Feel free to ask any questions and/or share any comments.

submitted by /u/ai-infos
[link] [comments]