Run Qwen3.5-397B-A13B with vLLM and 8xR9700

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The post explains how to run the Qwen3.5-397B-A13B (listed as a 397B MXFP4 variant) on an 8xR9700 setup using vLLM on ROCm via a custom Docker image.
  • It provides a Dockerfile that installs an updated Transformers version and applies a Triton patch to adjust a topk-related constant for compatibility/performance.
  • It links to an MXFP4 model checkpoint hosted on Hugging Face and gives step-by-step commands for cloning the model with Git LFS.
  • It includes a detailed docker run launch command configuring multiple GPU device mappings, HIP/ROCR visibility, shared memory, and vLLM settings such as prefix caching and near-full GPU memory utilization.
  • The author claims the 397B model runs “super fast,” positioning the guide as an approach to enable large (over-100B) model inference on specific ROCm hardware.
Run Qwen3.5-397B-A13B with vLLM and 8xR9700

Special thanks for u/Sea-Speaker1700 to make possible run mxfp4 on R0700 GPU, first guide to run 122B models here

Well, 397B model works amazing, super fast.

Use this Dockerfile to build image, original image provided by u/Sea-Speaker1700

FROM tcclaviger/vllm-rocm-rdna4-mxfp4:latest # Transformers Update RUN pip install --upgrade transformers # Triton Patch RUN find /app -name "topk.py" -exec grep -l "N_EXPTS_ACT=k," {} \; | xargs -I{} sed -i 's/N_EXPTS_ACT=k, # constants/N_EXPTS_ACT=__import__("triton").next_power_of_2(k), # constants/' {} CMD ["/bin/bash"] 

build patched version

docker build -t vllm-mxfp4-patched -f Dockerfile .

Download model:

git lfs clone https://huggingface.co/djdeniro/Qwen3.5-397B-A17B-MXFP4

Launch script, keep your device id, replace $1 with model name, $2 with out port.

docker run --name "$1" \ --rm --tty --ipc=host --shm-size=32g \ --device /dev/kfd:/dev/kfd \ --device /dev/dri/renderD128:/dev/dri/renderD128 \ --device /dev/dri/renderD129:/dev/dri/renderD129 \ --device /dev/dri/renderD130:/dev/dri/renderD130 \ --device /dev/dri/renderD131:/dev/dri/renderD131 \ --device /dev/dri/renderD132:/dev/dri/renderD132 \ --device /dev/dri/renderD137:/dev/dri/renderD137 \ --device /dev/dri/renderD138:/dev/dri/renderD138 \ --device /dev/dri/renderD139:/dev/dri/renderD139 \ --device /dev/mem:/dev/mem \ -e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ -e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ -v /mnt/llm_disk/models:/app/models:ro \ -e TRUST_REMOTE_CODE=1 \ -e OMP_NUM_THREADS=8 \ -e PYTORCH_TUNABLEOP_ENABLED=1 \ -e PYTORCH_TUNABLEOP_TUNING=0 \ -e PYTORCH_TUNABLEOP_RECORD_UNTUNED=0 \ -e VLLM_ROCM_USE_AITER=0 \ -e PYTORCH_TUNABLEOP_FILENAME=/tunableop/tunableop_merged.csv \ -e PYTORCH_TUNABLEOP_UNTUNED_FILENAME=/tunableop/tunableop_untuned%%d.csv \ -e GPU_MAX_HW_QUEUES=1 \ -p "$2":8000 \ -e TRITON_CACHE_DIR=/root/.triton/cache \ vllm-mxfp4-patched \ /app/models/Qwen3.5-397B-A17B-MXFP4 \ --served-model-name "$1" --host 0.0.0.0 --port 8000 --trust-remote-code \ --enable-prefix-caching --gpu-memory-utilization 0.98 --tensor-parallel-size 8 \ --max-model-len 131072 --max-num-seqs 4 \ --tool-call-parser qwen3_coder --enable-auto-tool-choice \ --override-generation-config '{"max_tokens": 64000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' \ --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64, 128], "max_cudagraph_capture_size": 128}' \ --max-num-batched-tokens 2048 \ --limit-mm-per-prompt.image 2 --mm-processor-cache-gb 1 \ --mm-processor-kwargs '{"max_pixels": 602112}' \ --reasoning-parser qwen3 

Loading model 400-600s first time, and then got 30 t/s on tg, 3.5-3.7k on pp in one request.

in 4x requests you will got up to 100 t/s.

I limit power per gpu (210W), if power limit 300W per gpu will speedup model.

Best result with this model i have when thinking budget is 0 tokens for coding tasks.

submitted by /u/djdeniro
[link] [comments]