16x AMD MI50 32GB at 32 t/s (tg) & 2k t/s (pp) with Qwen3.5 397B (vllm-gfx906-mobydick)

Reddit r/LocalLLaMA / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Read original →

共有:

Key Points

The post reports a large local inference setup using 16× AMD Radeon MI50 32GB GPUs running Qwen3.5 397B (A17B GPTQ 4-bit) on the vllm-gfx906-mobydick VLLM fork.
It claims throughput of about 32 tokens/s for output (TG) and around 2,000 tokens/s for prompt processing/input (PP) with a 20k-token input context.
The setup targets cost-effective deployment, citing power draw of ~550W idle and ~2,400W peak during inference, and emphasizes the benefits of high bandwidth plus tensor parallelism and multi-token prediction.
The author says the vllm-gfx906-mobydick fork has improved stability for running both DeepSeek v3.2 and Qwen3.5 397B with FP32 activations (while using some FP16 attention computations for performance), enabling “big context” runs.
They provide GitHub links for the fork and detailed build/benchmark guidance, and suggest future scaling to 32× MI50 32GB for other models like Kimi K2.5 Thinking and GLM-5.

16x AMD MI50 32GB at 32 t/s (tg) & 2k t/s (pp) with Qwen3.5 397B (vllm-gfx906-mobydick)

Qwen3.5 397B A17B GPTQ 4-bit @ 32 tok/s (output) and 2000 tok/s (input of 20k tok) on vllm-gfx906-mobydick

16 mi50 32gb setup

Github link of vllm fork: https://github.com/ai-infos/vllm-gfx906-mobydick

Power draw: 550W (idle) / 2400W (peak inference)

Goal: run Qwen3.5 397B A17B GPTQ 4-bit on most cost effective hardware like 16*MI50 at decent speed (token generation & prompt processing)

Coming next: open source a future test setup of 32 AMD MI50 32GB for Kimi K2.5 Thinking and/or GLM-5

Credits: BIG thanks to the Global Open source Community!

All setup details here:

https://github.com/ai-infos/guidances-setup-16-mi50-qwen35-397b

Feel free to ask any questions and/or share any comments.

ps: it might be a good alternative to mix CPU/GPU hardwares as RAM/VRAM price increases and the token generation/prompt processing speed will be much better with 16 TB/s bandwidth + tensor parallelism + mtp (multi token prediction)!

ps2: few months ago I did a similar post for deepseek v3.2. The initial goal of the vllm-gfx906-mobydick was actually to run big models like deepseek but previously, the fork wasn't steady enough using FP16 activation. Now the fork is pretty steady for both models deepseek v3.2 and qwen3.5 397B at big context using FP32 activation (with some FP16 attention computations for perf).

ps3: With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like Qwen3.5 27B (reaching 56 tok/s at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another posts showing benchmarks with smaller setups)

ps4: the idea of using FP32 activation (with a mix of FP16 attention computations) instead of full BF16 for old consumer GPU that do not support BF16 can obviously be extended to other GPU than AMD MI50. So I guess this vllm-gfx906-mobydick fork can be reused for other older GPU (with or without some adaptations)

rocm-smi

ps5: the image above (rocm-smi) show the temps/power when vllm idle (after some generation; peak is around 71°C /120W per gpu)

submitted by /u/ai-infos
[link] [comments]