Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A user reports successfully running LLM models (e.g., Minimax 2.7 Q4) using both CUDA and ROCm simultaneously by setting -DGGML_BACKEND_DL=ON and bypassing Vulkan.
  • They note that the main performance advantage comes from the “prefill” stage, with detailed GPU layer offloading and per-device model buffer sizes shown in logs (CUDA0 and ROCm0).
  • The post includes a Windows build setup using CMake/Ninja, specifying ROCm toolchains via clang-cl, enabling HIP and CUDA, and configuring CUDA architecture and build type.
  • They mention that enabling -DGGML_CPU_ALL_VARIANTS=ON caused many compilation errors, requiring manual edits to the ggml CMakeLists.txt, and that the setup works more smoothly on a Ryzen 5950X.
  • Runtime instructions are provided, including setting the ROCm PATH and running llama-server with extensive inference-related flags (context size, threads, flash-attn, cache types, and parallelism).
Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !

I invested quite a bit of time and it wasn't easy but finally I can run models like Minimax 2.7 Q4 using Cuda+ROCm at the same time bypassing Vulkan.

load_tensors: offloaded 63/63 layers to GPU

load_tensors: CUDA0 model buffer size = 83650.42 MiB

load_tensors: CUDA_Host model buffer size = 622.76 MiB

load_tensors: ROCm0 model buffer size = 40314.35 MiB

the main advantage is the prefill.

On windows :

rmdir /s /q build

cmake -B build -G Ninja ^

-DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^

-DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^

-DCMAKE_HIP_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^

-DCMAKE_PREFIX_PATH="C:/Program Files/AMD/ROCm/6.4" ^

-DHIP_ROOT_DIR="C:/Program Files/AMD/ROCm/6.4" ^

-DGGML_HIP=ON ^

-DGGML_CUDA=ON ^

-DGGML_BACKEND_DL=ON ^

-DGGML_CPU_ALL_VARIANTS=ON ^

-DGGML_AVX_VNNI=OFF ^

-DGGML_AVX512=OFF ^

-DGGML_AVX512_VBMI=OFF ^

-DGGML_AVX512_VNNI=OFF ^

-DGGML_AVX512_BF16=OFF ^

-DGGML_AMX_TILE=OFF ^

-DGGML_AMX_INT8=OFF ^

-DGGML_AMX_BF16=OFF ^

-DCMAKE_CUDA_COMPILER="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/nvcc.exe" ^

-DCMAKE_CUDA_ARCHITECTURES="120" ^

-DCMAKE_BUILD_TYPE=Release

___________________

cmake --build build -j

_______________________

Unfortunately, this flag: -DGGML_CPU_ALL_VARIANTS=ON --> creates many compilation errors and I had to edit, for example:

notepad C:\llm\llamacpp\ggml\src\CMakeLists.txt

and remove # ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C FMA AVX2 BMI2 AVX_VNNI)

With Ryzen 5950x it's ok.

then:

set PATH=C:\Program Files\AMD\ROCm\6.4\bin;%PATH%

llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 91920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1

Done.

submitted by /u/LegacyRemaster
[link] [comments]