Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro.

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The author released Warpdrv, an open-source desktop launcher designed to run LLMs locally using llama.cpp and to conveniently manage multiple backend sessions.
The setup focuses on running Qwen 3.6 27B and 35B in parallel, using CUDA for one model and Strix Halo unified memory with ROCm/Vulkan support for the other.
Warpdrv includes features such as chat tool calling via MCP.json, a model router for coding-focused workflows, and experimental KV-cache checkpointing.
The project does not bundle a prebuilt llama.cpp binary, but provides configurable “recipes” (bash scripts with UI) to build backends with one click.
The post also shares early, bare-metal instructions for getting ROCm working on Strix Halo under Ubuntu 25.10, including kernel, BIOS, and configuration steps.

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro.

I wanted to share an open-source app that I built for running LLMs locally on my setup.

My setup

Hardware

FEVM FAEX1 (128GB)
RTX Pro 5000 Blackwell (48GB), connected over OCuLink
Aoostar AG02
2x2TB internal m.2 drives on raid-0 using mdadm.

Software: Ubuntu 25.10, llama.cpp built from source for cuda + vulkan, rocm.

How I use this app

I generally run two models in parallel using different Llama backends simultaneously - Qwen3.6 27b UD-Q6-KXL or NVFP4 on CUDA, and Qwen3.6 35b A3B UD-Q6-KXL on the Strix Halo unified memory. I mostly use them with opencode for coding. The built in model-router comes in handy.

What else can the app do

Does basic things any llama.cpp wrappers can do + some other things. Overall it's a convenience app to spin up llama-server instances for any purposes. And it's open-source.

MCP.json + tool calling in chat
Model Router for opencode / claude-code local.
KV-cache checkpointing (experimental).
It does NOT ship with a llama.cpp build. But you can configure recipes (bash scripts with a UI) to build them with one-click.

More info on the Read Me, along with some guides.

Visit warpdrv on GitHub

It's an early-stage alpha release, so expect some minor bugs - I have mostly fixed the major ones. Feature requests as well as bug reports are welcome.

---

Setting up ROCm on Strix Halo (Ubuntu 25.10)

Strix Halo on Linux needs some setup before ROCm works natively for gfx1151. I am aware of the docker-based toolboxes for Strix Halo. They work and are a good option. I just wanted bare-metal without containers.

I am including the steps below for those interested in trying it out.

Install mainline kernel 6.18. Use the Mainline Kernels desktop app on Ubuntu 25.10. Reboot.
- Verify: uname -r shows 6.18.x.
In BIOS, I set dedicated iGPU VRAM to 4GB and enabled Resizable BAR. The remaining 124GB stays as unified memory accessible via GTT.
Add GRUB params. In /etc/default/grub.d/ add: iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856 amdgpu.cwsr_enable=0. Note: amdgpu.gttsize is deprecated on recent kernels but still respected. Kept alongside ttm.pages_limit as belt-and-suspenders. Run update-grub and reboot.
- Verify: cat /sys/class/drm/card*/device/mem_info_gtt_total shows ~124GB.
Optionally update firmware. Clone the upstream linux-firmware tree and copy the MES blobs to /lib/firmware/amdgpu/. Check md5 first - my firmware was already the latest one, so I didnt run this step.
Install ROCm 7.2. On the host via AMD repo. Add symlink: libxml2.so.16 -> libxml2.so.2, otherwise some libs won't load.
- Verify: rocminfo | grep gfx shows gfx1151.
Build llama.cpp for ROCm. cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" \ -DCMAKE_BUILD_TYPE=Release -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
Three things to know when running:

Don't set HSA_OVERRIDE_GFX_VERSION. It forces gfx1100 kernel dispatch on gfx1151 and segfaults in rms_norm.
Required runtime flags: --no-warmup -fa 1 -dio --no-mmap. Without --no-warmup it segfaults during the warmup phase.
Verify: run llama-cli with a model, confirm it loads and generates tokens without segfault.

Additionally, I build llama.cpp from source for CUDA 13.2 (for RTX Pro 5000) with the standard -DGGML_CUDA=ON flow, no special handling.

---

PS. Apple Mac: I dont own a Mac so I am unable to test the app on MacOS yet. Feel free to build from source, or share the build with me so I can add it to the releases on GitHub, I can shout-out to your GitHub handle in the ReadMe, thanks :)

submitted by /u/xornullvoid
[link] [comments]