Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

Strix Halo miniPC（AMD APU）にOCuLink接続したRTX 5070 Ti eGPUを、llama.cppでVulkanバックエンドを使って動作させ、AMD+NVIDIAの異種構成でも安定して推論できることを報告しています。
ネイティブCUDA/ROCm実行に対する性能低下は概ね5〜10%程度で、Vulkanが異メーカー同士の組み合わせに有効であると結論づけています。
OCuLink（帯域）はtoken生成（tg）やプロンプト処理（pp）をボトルネックせず、転送データ量は小さい一方で、速い側GPUが遅い側APU待ちでアイドル化することが遅延の主因だと述べています。
llama.cppで層処理が逐次（リレー競走のように順番待ち）になるため、遅い側へのオフロードは非線形に効いて、全体性能がAmdahlの法則の説明通り（双曲線的に）落ちると分析しています。
ベンチマーク（llama-bench/モデル・量子化・バックエンドごとのt/s）を提示し、異種デバイス連携の実測に基づく所見としてまとめています。

Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions

https://preview.redd.it/nqok3dch7utg1.jpg?width=4096&format=pjpg&auto=webp&s=d5c1d3f5e5c1d8c0ba986726d2bda08212175fec

Hey everyone. I have a Strix Halo miniPC (Minisforum MS-S1 Max). I added an RTX 5070 Ti eGPU to it via OCuLink, ran some tests on how they work together in llama.cpp, and wanted to share some of my findings.

TL;DR of my findings:

Vulkan's versatility: It's a highly efficient API that lets you stably combine chips from different vendors (like an AMD APU + NVIDIA GPU). The performance drop compared to native CUDA or ROCm is minimal, just about 5–10%.
The role of OCuLink: The bandwidth of this connection doesn't bottleneck token generation (tg) or prompt processing (pp). The data transferred is tiny. The real latency comes from the fast GPU idling while waiting for the slower APU.
Amdahl's Law and Tensor Split: Since devices in llama.cpp process layers strictly sequentially (like a relay race), offloading some computations to slower memory causes a non-linear, hyperbolic drop in overall speed. This overall performance degradation for sequential execution is exactly what Amdahl's Law describes.

First, here are the standard llama-bench results for each GPU using their native backends:

~/llama.cpp/build-rocm/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	1493.28 ± 30.20
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp2048	1350.47 ± 40.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp8192	958.19 ± 1.85
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	50.16 ± 0.07

~/llama.cpp/build-cuda/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15841 MiB): Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15841 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	8476.95 ± 206.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	8081.18 ± 27.82
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	6266.69 ± 6.90
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	179.20 ± 0.13

Now, the tests for each GPU using Vulkan:

GGML_VK_VISIBLE_DEVICES=0 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	7466.51 ± 17.68
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp2048	7216.51 ± 1.77
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp8192	6319.98 ± 7.82
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	167.77 ± 1.56

GGML_VK_VISIBLE_DEVICES=1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	1327.76 ± 17.68
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp2048	1252.70 ± 5.86
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp8192	960.10 ± 2.37
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	52.29 ± 0.15

And the most interesting part: testing both GPUs working together with tensor split via Vulkan. The model weights were distributed between the NVIDIA RTX 5070 Ti VRAM and the AMD Radeon 8060S UMA in the following proportions: 100%/0%, 90%/10%, 80%/20%, 70%/30%, 60%/40%, 50%/50%, 40%/60%, 30%/70%, 20%/80%, 10%/90%, 0%/100%.

GGML_VK_VISIBLE_DEVICES=0,1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -dev vulkan0/vulkan1 -ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10 -n 128 -p 512 -r 10

model	size	params	backend	ngl	fa	dev	ts	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	10.00	pp512	7461.22 ± 6.37
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	10.00	tg128	168.91 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	9.00/1.00	pp512	5790.85 ± 52.68
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	9.00/1.00	tg128	130.22 ± 0.40
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	8.00/2.00	pp512	4230.90 ± 28.90
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	8.00/2.00	tg128	112.66 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	7.00/3.00	pp512	3356.88 ± 27.64
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	7.00/3.00	tg128	99.83 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	6.00/4.00	pp512	2658.89 ± 13.26
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	6.00/4.00	tg128	85.67 ± 2.50
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	5.00/5.00	pp512	2185.28 ± 16.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	5.00/5.00	tg128	76.73 ± 1.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	4.00/6.00	pp512	1946.46 ± 19.60
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	4.00/6.00	tg128	62.84 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	3.00/7.00	pp512	1644.25 ± 29.88
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	3.00/7.00	tg128	58.38 ± 0.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	2.00/8.00	pp512	1458.99 ± 19.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	2.00/8.00	tg128	55.70 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	1.00/9.00	pp512	1304.67 ± 45.80
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	1.00/9.00	tg128	54.16 ± 1.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	0.00/10.00	pp512	1194.55 ± 5.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	0.00/10.00	tg128	52.62 ± 0.72

During token generation with split layers, the drop in overall tg and pp speed follows Amdahl's Law. Moving even a small fraction of layers to lower-bandwidth memory creates a bottleneck, leading to a non-linear drop in overall speed (t/s). If you graph it, it forms a classic hyperbola.

https://preview.redd.it/8frnjhri7utg1.jpg?width=1600&format=pjpg&auto=webp&s=2577562f66d60ba572670cea11bad2da588c6256

Formula: P(s) = 100 / [1 + s(k - 1)]

Where:

P(s) = total system speed (in % of max eGPU speed).
s = fraction of the model offloaded to the slower APU RAM (from 0 to 1, where 0 is all in VRAM and 1 is all in RAM).
k = memory bandwidth gap ratio. Calculated as max speed divided by min speed (k = V_max / V_min).

As you can see, the overall tg and pp speeds depend only on the tg and pp of each node. OCuLink doesn't affect the overall speed at all.

Detailed Conclusions & Technical Analysis:

Based on the benchmark data and the architectural specifics of LLMs, here is a deeper breakdown of why we see these results.

1. Vulkan is the Ultimate API for Cross-Vendor Inference

Historically, mixing AMD and NVIDIA chips for compute tasks in a single pipeline has been a driver nightmare. However, llama.cpp's Vulkan backend completely changes the game.

The Justification: Vulkan abstracts the hardware layer, standardizing the matrix multiplication math across entirely different architectures (RDNA 3.5 on the APU and the Ada/Blackwell architecture on the RTX 5070 Ti).
The Result: It allows for seamless, stable pooling of discrete VRAM and system UMA memory. The performance penalty compared to highly optimized, native backends like CUDA or ROCm is practically negligible (only about 5–10%). You lose a tiny fraction of raw speed to the API translation layer, but you gain the massive advantage of fitting larger models across different hardware ecosystems without crashing.

2. The OCuLink Myth: PCIe 4.0 x4 is NOT a Bottleneck for LLMs

There is a widespread stereotype in the eGPU community that the limited bandwidth of OCuLink (~7.8 GB/s or 64 Gbps) will throttle AI performance. For LLM inference, this is completely false. The OCuLink bandwidth is utilized by a mere 1% during active generation. Here is the math behind why the communication penalty is practically zero:

Token Generation (Decode Phase): Thanks to the Transformer architecture, GPUs do not send entire neural networks back and forth. When the model is split across two devices, they only pass a small tensor of hidden states (activations) for a single token at a time. For a 7B or even a 70B model, this payload is roughly a few dozen Kilobytes. Sending kilobytes over a 7.8 GB/s connection takes fractions of a microsecond.
Context Processing (Prefill Phase): Even when digesting a massive prompt of 10,000+ tokens, llama.cpp processes the data in chunks (typically 512 tokens at a time). A 512-token chunk translates to just a few Megabytes of data transferred across the PCIe bus. Moving 8MB over OCuLink takes about 1 millisecond. Meanwhile, the GPUs take tens or hundreds of milliseconds to actually compute that chunk.
The True Bottleneck: System speed is dictated entirely by the Memory Bandwidth of the individual nodes (RTX 5070 Ti at ~900 GB/s vs APU at ~200 GB/s), not the PCIe connection between them. The only scenarios where OCuLink's narrow bus will actually hurt you are the initial loading of the model weights from your SSD/RAM into the eGPU (taking 3–4 seconds instead of 1) or during full fine-tuning, which requires constantly moving massive arrays of gradients.

3. Amdahl’s Law and the "Relay Race" Pipeline Stalls

When using Tensor Splitting across multiple devices at batch size 1 (standard local inference without micro-batching), llama.cpp executes a strictly sequential pipeline.

The Justification: Layer 2 cannot be computed until Layer 1 is finished. If you put 80% of the model on the lightning-fast RTX 5070 Ti and 20% on the slower AMD APU, they do not work simultaneously. The RTX processes its layers instantly, passes the tiny activation tensor over OCuLink, and then goes to sleep (Pipeline Stall). It sits completely idle, waiting for the memory-bandwidth-starved APU to grind through its 20% share of the layers.
The Result: You are not adding compute power; you are adding a slow runner to a relay race. Because the fast GPU is forced to wait, the performance penalty of offloading layers to slower system memory is non-linear. As shown in the data, it perfectly graphs out as a classic hyperbola governed by Amdahl's Law. Moving just 10-20% of the workload to the slower node causes a disproportionately massive drop in total tokens per second.

System Configuration:

Base: Minisforum MS-S1 Max (Strix Halo APU, AMD Radeon 8060S iGPU, RDNA 3.5 architecture). Quiet power mode.
RAM: 128GB LPDDR5X-8000 (iGPU memory bandwidth is ~210 GB/s in practice, theoretical is 256 GB/s).
OS: CachyOS (Linux 6.19.11-1-cachyos) with the latest Mesa driver (RADV). Booted with GRUB params: GRUB_CMDLINE_LINUX="... iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856"

eGPU Setup:

GPU: NVIDIA RTX 5070 Ti
To get an OCuLink port on the Minisforum MS-S1 Max, I added a PCIe 4.0 x4 to OCuLink SFF8611/8612 adapter.
Dock: I bought a cheap F9G-BK7 eGPU dock. PSU is a 1STPLAYER NGDP Gold 850W.
Everything worked right out of the box, zero compatibility issues.

submitted by /u/xspider2000
[link] [comments]