Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions (Part 2)

Reddit r/LocalLLaMA / 4/9/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

The post benchmarks llama.cpp performance on a hybrid setup that splits model layers between an APU (Strix Halo via RADV) and an external GPU (RTX 5070 Ti via OCuLink), using Vulkan backends.
It tests the 27B Qwen3.5-27B-UD-Q4_K_XL model by varying tensor split ratios in 10% increments from 100% APU/0% eGPU to 0% APU/100% eGPU.
The author compares measured Prompt Processing (PP) and Token Generation (TG) metrics against performance predictions from a previously published universal estimation formula.
A key takeaway is that while benchmarks were requested by the community, the author argues that the earlier predictive method largely makes extensive reruns unnecessary for similar APU+eGPU/tensor-split configurations.
The overall aim of the follow-up is to clarify the methodology so users can estimate performance for any model using tensor splits more reliably.

Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions (Part 2)

https://preview.redd.it/wqk6fh12d0ug1.jpg?width=4096&format=pjpg&auto=webp&s=292562e4000da9239b21ca5dc0e01adcf127f127

Hello everyone! Based on the community's feedback in previous post, I decided to write this post to clarify and expand on a few things.

Many of you in the comments asked for benchmarks, so I'll start with benchmarks for current models.

I benchmarked Qwen3.5-27B-UD-Q4_K_XL.gguf, distributing the layers (tensor split) between the APU and the eGPU in 10% increments: from 100%/0% to 0%/100%.

Below, I'll show why, in reality, running these benchmarks wasn't strictly necessary. We will compare the actual PP (Prompt Processing) and TG (Token Generation) metrics with the ones predicted by the formula from my first article. The main goal of the previous post was to demonstrate a universal method for estimating the performance of an APU+eGPU setup for any model when using a tensor split. However, judging by the number of questions, I didn't convey this idea clearly enough—so I'm correcting that now!

~/llama.cpp/build-vulkan/bin/llama-bench \ -m ~/Qwen3.5-27B-UD-Q4_K_XL.gguf \ -ngl 99 \ -fa 1 \ -dev vulkan1/vulkan0 \ -ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10 ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	dev	ts	test	t/s
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	10.00	pp512	268.02 ± 0.46
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	10.00	tg128	11.89 ± 0.03
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	9.00/1.00	pp512	280.95 ± 10.11
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	9.00/1.00	tg128	12.43 ± 0.03
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	8.00/2.00	pp512	267.87 ± 9.95
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	8.00/2.00	tg128	12.89 ± 0.02
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	7.00/3.00	pp512	293.02 ± 2.44
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	7.00/3.00	tg128	13.48 ± 0.13
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	6.00/4.00	pp512	336.32 ± 1.94
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	6.00/4.00	tg128	14.62 ± 0.24
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	5.00/5.00	pp512	377.92 ± 14.46
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	5.00/5.00	tg128	17.20 ± 0.08
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	4.00/6.00	pp512	462.06 ± 3.56
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	4.00/6.00	tg128	19.81 ± 0.08
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	3.00/7.00	pp512	563.40 ± 1.84
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	3.00/7.00	tg128	22.19 ± 0.10
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	2.00/8.00	pp512	757.22 ± 3.64
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	2.00/8.00	tg128	26.05 ± 0.06
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	1.00/9.00	pp512	988.62 ± 5.18
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	1.00/9.00	tg128	30.25 ± 0.06

ggml_vulkan: Device memory allocation of size 1067094656 failed. ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory main: error: failed to load model '~/Qwen3.5-27B-UD-Q4_K_XL.gguf'

The model didn't entirely fit into VRAM, so at 100% VRAM offload, llama-bench crashed with an out-of-memory error.

In the comments, many people were rightly surprised as to why I ran tests on the outdated llama-2-7b.Q4_0.gguf. Let me explain, it was a conscious choice for two reasons:

It's a universal baseline for comparison. Historically, this exact model became the "gold standard" for testing LLM hardware. There is a massive database of results online (for example, in this GitHub thread) for a wide variety of configurations: Apple Silicon, NVIDIA, AMD, APUs, and their backends. By comparing the TG and PP metrics on this Llama, it's easy to understand the performance level of our APU+eGPU combo relative to any other hardware out there.
Calculating the hardware performance constant. On this model, I measured the TG128 and PP512 speeds for each node separately (when the model is loaded entirely on the RTX 5070 Ti or entirely on the Strix Halo). The absolute numbers of the old Llama aren't as important to us—what matters is their ratio. The ratio of GPU speed to APU speed (let's call it the GtA_ratio) is a constant that depends solely on the memory bandwidth and the compute power of the chips themselves. And this constant will be the same for any model.

Here is what it looks like in numbers:

Token Generation (TG128): For the 5070 Ti, it's 168.91 t/s; for the Strix Halo, it's 52.62 t/s. The TG128 GtA_ratio constant = 168.91 / 52.62 = 3.21.
Prompt Processing (PP512): For the 5070 Ti, it's 7461.22 t/s; for the Strix Halo, it's 1194.55 t/s. The PP512 GtA_ratio constant = 7461.22 / 1194.55 = 6.25.

Naturally, if you swap the graphics card for a different one, these constants will change. But knowing them for your current system allows you to predict speeds for any new LLM.

In the previous article, I mentioned that the performance drop during Tensor Split follows Amdahl's Law, and the graph of this drop is a hyperbola. For greater clarity, I have slightly adapted the base formula.

Here is what it looks like now:

Perf = [ GtA_ratio / ( 1 + (Share / 100) * (GtA_ratio - 1) ) ] * 100%

Where:

Perf — total system performance (as a percentage relative to the base APU speed).
GtA_ratio — our eGPU-to-APU speed ratio (the constant we calculated earlier).
Share — the percentage of the model offloaded to the slower system memory (APU RAM). It ranges from 0 to 100, where 0 means the entire model fits into the fast eGPU VRAM, and 100 means it runs entirely in the system RAM.

Let's plot the overall performance graph based on our baseline llama-2-7b.Q4_0.gguf benchmarks.

https://preview.redd.it/ki4nhgty00ug1.png?width=3000&format=png&auto=webp&s=f5a96195b565d75591545cabe24ac69c14df2377

Now, let's overlay the fresh test results for the current Qwen3.5-27B-UD-Q4_K_XL.gguf model onto this hyperbola.

Just a quick reminder: because the model didn't fully fit into VRAM, the final data point (100% VRAM offload) is missing from the graph

As you can see, the real Qwen3.5 tests fit our mathematical curve perfectly! This proves the main point: to estimate the system performance for any new model, you don't necessarily have to run benchmarks. It's enough to follow a simple 3-step algorithm:

Calculate the model's "tail": Subtract the GPU VRAM capacity (in my case, 16 GB) from the model file size. This tells us how many gigabytes of weights won't fit in the eGPU and will be sent to the Strix Halo's RAM.
Find the s percentage: Convert this "tail" into a percentage of the total model weight. The resulting number is our Share value.
Apply the formula: Plug in Share and our GtA_ratio constants to calculate the final speed Perf.

For my system (RTX 5070 Ti + Strix Halo), the calculations look like this:

For Token Generation (TG128): GtA_ratio = 3.21. Formula:

Perf_tg128 = [ 3.21 / ( 1 + (Share / 100) * (3.21 - 1) ) ] * 100%

For Prompt Processing (PP512): GtA_ratio = 6.25. Formula:

Perf_pp512 = [ 6.25 / ( 1 + (Share / 100) * (6.25 - 1) ) ] * 100%

Reminder: Perf_tg128 and Perf_pp512 will show you the operating speed as a percentage relative to running the model solely on a single APU.

Another hot topic in the comments is the choice of eGPU interface. Many people asked about OCuLink versus Thunderbolt (TB) or USB4. Let's break down the mechanics of the process to clear up all questions.

As I mentioned before, OCuLink is not a bottleneck for either prompt processing (PP) or token generation (TG). To understand why, let's look at what makes up the generation time of a single token when using Tensor Split. It is always the sum of three stages:

Computing the first chunk of layers on the eGPU.
Transmitting the activation tensor (intermediate results) through the cable from the eGPU to the APU.
Computing the remaining layers in the APU's system RAM.

And here lies the most crucial nuance: during the second stage, latency is far more important than bandwidth.

The size of the transmitted activation tensor is relatively small, so the raw bandwidth of any modern interface (whether OCuLink, TB, or USB4) is more than enough with plenty of headroom. They do not saturate the "pipe." But because this transmission cycle repeats for every single generated token, what comes to the forefront is how quickly the signal initializes and travels from point A to point B.

This is where the main technical difference lies:

OCuLink is essentially a "naked" PCIe bus extension. Data travels directly to the CPU lanes with the lowest possible latency.
Thunderbolt and USB4 are forced to package (encapsulate) the PCIe signal into their own protocol, pass it through a controller, and then unpack it on the other side. This adds overhead and micro-delays to every transaction.

Therefore, if you have a choice of interface for local LLMs, it is highly recommended to use OCuLink.

Finally, as promised, here is the benchmark on my system for the Qwen3.5-122B-A10B-UD-Q4_K_XL model:

~/llama.cpp/build-vulkan/bin/llama-bench \ -m ~/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf \ -ngl 99 \ -fa 1 \ -dev vulkan1/vulkan0 \ -ts 100/0,95/5,90/10,85/15,80/20,75/25,70/30 ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	dev	ts	test	t/s
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	100.00	pp512	247.59 ± 5.96
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	100.00	tg128	19.46 ± 0.26
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	95.00/5.00	pp512	270.07 ± 2.77
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	95.00/5.00	tg128	19.91 ± 0.63
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	90.00/10.00	pp512	281.56 ± 12.32
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	90.00/10.00	tg128	20.40 ± 0.39
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	85.00/15.00	pp512	295.46 ± 16.68
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	85.00/15.00	tg128	20.75 ± 0.57
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	80.00/20.00	pp512	311.33 ± 2.39
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	80.00/20.00	tg128	21.79 ± 0.46

ggml_vulkan: Device memory allocation of size 650418176 failed. ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory main: error: failed to load model '~/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf'

As you can see, because only a small fraction of the model (up to 20%) fit into the VRAM, the overall TG and PP speeds increased only slightly. Specifically, Token Generation (TG) went up by just ~12% (from 19.46 to 21.79 t/s), and Prompt Processing (PP) increased by ~25.7% (from 247.59 to 311.33 t/s).

For massive models, the performance uplift is limited simply because the eGPU's VRAM capacity is usually much smaller than the massive system RAM available on the Strix Halo.

submitted by /u/xspider2000
[link] [comments]