Run Qwen3.5-397B-A13B with vLLM and 8xR9700

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The post explains how to run the Qwen3.5-397B-A13B (listed as a 397B MXFP4 variant) on an 8xR9700 setup using vLLM on ROCm via a custom Docker image.
It provides a Dockerfile that installs an updated Transformers version and applies a Triton patch to adjust a topk-related constant for compatibility/performance.
It links to an MXFP4 model checkpoint hosted on Hugging Face and gives step-by-step commands for cloning the model with Git LFS.
It includes a detailed docker run launch command configuring multiple GPU device mappings, HIP/ROCR visibility, shared memory, and vLLM settings such as prefix caching and near-full GPU memory utilization.
The author claims the 397B model runs “super fast,” positioning the guide as an approach to enable large (over-100B) model inference on specific ROCm hardware.

Run Qwen3.5-397B-A13B with vLLM and 8xR9700

Special thanks for u/Sea-Speaker1700 to make possible run mxfp4 on R0700 GPU, first guide to run 122B models here

Well, 397B model works amazing, super fast.

Use this Dockerfile to build image, original image provided by u/Sea-Speaker1700

FROM tcclaviger/vllm-rocm-rdna4-mxfp4:latest # Transformers Update RUN pip install --upgrade transformers # Triton Patch RUN find /app -name "topk.py" -exec grep -l "N_EXPTS_ACT=k," {} \; | xargs -I{} sed -i 's/N_EXPTS_ACT=k, # constants/N_EXPTS_ACT=__import__("triton").next_power_of_2(k), # constants/' {} CMD ["/bin/bash"]

build patched version

docker build -t vllm-mxfp4-patched -f Dockerfile .

Download model:

git lfs clone https://huggingface.co/djdeniro/Qwen3.5-397B-A17B-MXFP4

Launch script, keep your device id, replace $1 with model name, $2 with out port.

docker run --name "$1" \ --rm --tty --ipc=host --shm-size=32g \ --device /dev/kfd:/dev/kfd \ --device /dev/dri/renderD128:/dev/dri/renderD128 \ --device /dev/dri/renderD129:/dev/dri/renderD129 \ --device /dev/dri/renderD130:/dev/dri/renderD130 \ --device /dev/dri/renderD131:/dev/dri/renderD131 \ --device /dev/dri/renderD132:/dev/dri/renderD132 \ --device /dev/dri/renderD137:/dev/dri/renderD137 \ --device /dev/dri/renderD138:/dev/dri/renderD138 \ --device /dev/dri/renderD139:/dev/dri/renderD139 \ --device /dev/mem:/dev/mem \ -e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ -e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ -v /mnt/llm_disk/models:/app/models:ro \ -e TRUST_REMOTE_CODE=1 \ -e OMP_NUM_THREADS=8 \ -e PYTORCH_TUNABLEOP_ENABLED=1 \ -e PYTORCH_TUNABLEOP_TUNING=0 \ -e PYTORCH_TUNABLEOP_RECORD_UNTUNED=0 \ -e VLLM_ROCM_USE_AITER=0 \ -e PYTORCH_TUNABLEOP_FILENAME=/tunableop/tunableop_merged.csv \ -e PYTORCH_TUNABLEOP_UNTUNED_FILENAME=/tunableop/tunableop_untuned%%d.csv \ -e GPU_MAX_HW_QUEUES=1 \ -p "$2":8000 \ -e TRITON_CACHE_DIR=/root/.triton/cache \ vllm-mxfp4-patched \ /app/models/Qwen3.5-397B-A17B-MXFP4 \ --served-model-name "$1" --host 0.0.0.0 --port 8000 --trust-remote-code \ --enable-prefix-caching --gpu-memory-utilization 0.98 --tensor-parallel-size 8 \ --max-model-len 131072 --max-num-seqs 4 \ --tool-call-parser qwen3_coder --enable-auto-tool-choice \ --override-generation-config '{"max_tokens": 64000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' \ --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64, 128], "max_cudagraph_capture_size": 128}' \ --max-num-batched-tokens 2048 \ --limit-mm-per-prompt.image 2 --mm-processor-cache-gb 1 \ --mm-processor-kwargs '{"max_pixels": 602112}' \ --reasoning-parser qwen3

Loading model 400-600s first time, and then got 30 t/s on tg, 3.5-3.7k on pp in one request.

in 4x requests you will got up to 100 t/s.

I limit power per gpu (210W), if power limit 300W per gpu will speedup model.

Best result with this model i have when thinking budget is 0 tokens for coding tasks.

submitted by /u/djdeniro
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Best AI Video Generator in 2026: Top Tools Tested & Compared

Dev.to

The Future of Agent Integration: A2A vs ANP and the Three-Layer Security Architecture

Dev.to

Run Qwen3.5-397B-A13B with vLLM and 8xR9700

Key Points

Related Articles

Black Hat USA

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Best AI Video Generator in 2026: Top Tools Tested & Compared

The Future of Agent Integration: A2A vs ANP and the Three-Layer Security Architecture

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer