Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers

Reddit r/LocalLLaMA / 3/19/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The post documents a working setup for running Qwen3.5-122B-A10B-GPTQ-Int4 on 4× AMD Radeon AI PRO R9700 GPUs using vLLM ROCm, with a Gigabyte MC62-G40 motherboard and Threadripper Pro 5955WX.
Real-world measurements on a 41k-context workflow show prefill at 34.9 s, total time of 101.7 s, and vLLM throughput about 4150 tokens/s with decode around 41 tokens/s, indicating a significant speed advantage for prefill over llama.cpp on the same hardware.
The user notes standard HF weights OOM’d at target settings, so GPTQ Int4 was used, and they mitigated model thinking with chat_template_kwargs and a small proxy; quality remains a trade-off versus llama.cpp Q5_K_XL.
The post includes a detailed docker run command and mentions optimizations/omissions (e.g., unnecessary flags) and acknowledges a downside: all 4 GPUs are engaged, implying scalability limits.

First, this not possible without u/djdeniro (https://www.reddit.com/r/LocalLLaMA/comments/1rlgovg/qwen35122ba10bgptqint4_on_4xr9700_recipe/); u/sloptimizer (https://www.reddit.com/r/LocalLLaMA/comments/1rlgovg/qwen35122ba10bgptqint4_on_4xr9700_recipe/o8wxdly/) and u/Ok-Ad-8976 (https://www.reddit.com/r/LocalLLaMA/comments/1rhk0gz/r9700_and_vllm_with_qwen35/), where i learnt the recipes to start this.

Hardware: 4× AMD Radeon AI PRO R9700 (32 GB each) with vLLM on a Gigabyte MC62-G40 + Threadripper Pro 5955WX, 6/8 dimm slots filled with 16gb ddr4 2133 rdimms - yes i bought off ebay and 2 were throwing ECs during burn-in.

Big surprise: for my real 41k-context workflow, prefill was dramatically faster than llama.cpp.

Measured result on one real task: - TTFT / prefill: 34.9 s - Total time: 101.7 s - vLLM reported about 4150 tok/s prompt throughput - basically blazing fast. - decode 41 tok/s

Compared with my earlier llama.cpp setup on the same box, this was a huge prefill win (70 t/s PP and 20 t/s TG - yuck).

notes: - used Qwen3.5-122B-A10B-GPTQ-Int4 - standard HF weights OOM’d at my target settings, so GPTQ Int4 was the path that fit - to stop Qwen from “thinking” all over the place, I had to send: chat_template_kwargs: {"enable_thinking": false} - OpenWebUI did not expose that cleanly for me, so I put a tiny proxy in front of vLLM to inject it - quality on my real workflow was still a bit worse than llama.cpp Q5_K_XL, so this is not a blanket “vLLM is better” claim — more like massive speed win, some quality trade-off

Working launch command: docker run --rm --tty \ --name vllm-qwen35-gptq \ --ipc=host \ --shm-size=128g \ --device /dev/kfd:/dev/kfd \ --device /dev/dri:/dev/dri \ --device /dev/mem:/dev/mem \ -e VLLM_ROCM_USE_AITER=1 \ -e HSA_OVERRIDE_GFX_VERSION=12.0.1 \ -e VLLM_ROCM_USE_AITER_MOE=1 \ -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \ -e HSA_ENABLE_SDMA=0 \ -v "$PWD/hf-cache:/root/.cache/huggingface" \ -p 8000:8000 \ rocm/vllm-dev:upstream_preview_releases_v0.17.0_20260303 \ vllm serve Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 \ --served-model-name Qwen3.5-122B \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 56000 \ --tensor-parallel-size 4 \ --disable-log-requests \ --max-num-seqs 1 \ --gpu-memory-utilization 0.95 \ --dtype float16

Things I found unnecessary / ignored on this image: - VLLM_V1_USE_PREFILL_DECODE_ATTENTION - VLLM_USE_TRITON_FLASH_ATTN - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Downsides (I am still not happy): - all 4 GPUs were fully engaged and got hot 90+c in an airconditioned room - i had a script running to kick my fans in full speed when GPU temps >90c. - high idle power (~90 W/GPU) on this setup, so this is still in burn-in / tuning stage - there was also a warning that vLLM was using a default MoE config for my GPU, so there may still be performance left on the table as support matures

Hope this helps someone out there. Godspeed.

submitted by /u/grunt_monkey_
[link] [comments]

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

日経XTECH

Run Claude Opus 4.6 via OpenAI-compatible API using your existing Pro/Max subscription

Dev.to

Jupyter AI Extension - Multi-LLM Support

Dev.to

How to Build an AI Team: The Solopreneur Playbook

Dev.to

Getting Started with AI Agents

Dev.to

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers

Key Points

Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

Run Claude Opus 4.6 via OpenAI-compatible API using your existing Pro/Max subscription

Jupyter AI Extension - Multi-LLM Support

How to Build an AI Team: The Solopreneur Playbook

Getting Started with AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer