https://preview.redd.it/nqok3dch7utg1.jpg?width=4096&format=pjpg&auto=webp&s=d5c1d3f5e5c1d8c0ba986726d2bda08212175fec
Hey everyone. I have a Strix Halo miniPC (Minisforum MS-S1 Max). I added an RTX 5070 Ti eGPU to it via OCuLink, ran some tests on how they work together in llama.cpp, and wanted to share some of my findings.
TL;DR of my findings:
- Vulkan's versatility: It's a highly efficient API that lets you stably combine chips from different vendors (like an AMD APU + NVIDIA GPU). The performance drop compared to native CUDA or ROCm is minimal, just about 5–10%.
- The role of OCuLink: The bandwidth of this connection doesn't bottleneck token generation (tg) or prompt processing (pp). The data transferred is tiny. The real latency comes from the fast GPU idling while waiting for the slower APU.
- Amdahl's Law and Tensor Split: Since devices in llama.cpp process layers strictly sequentially (like a relay race), offloading some computations to slower memory causes a non-linear, hyperbolic drop in overall speed. This overall performance degradation for sequential execution is exactly what Amdahl's Law describes.
First, here are the standard llama-bench results for each GPU using their native backends:
~/llama.cpp/build-rocm/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model | size | params | backend | ngl | fa | test | t/s |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp512 | 1493.28 ± 30.20 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp2048 | 1350.47 ± 40.94 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp8192 | 958.19 ± 1.85 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | tg128 | 50.16 ± 0.07 |
~/llama.cpp/build-cuda/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15841 MiB): Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15841 MiB
| model | size | params | backend | ngl | fa | test | t/s |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 8476.95 ± 206.73 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp2048 | 8081.18 ± 27.82 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp8192 | 6266.69 ± 6.90 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 179.20 ± 0.13 |
Now, the tests for each GPU using Vulkan:
GGML_VK_VISIBLE_DEVICES=0 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192
ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model | size | params | backend | ngl | fa | test | t/s |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp512 | 7466.51 ± 17.68 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp2048 | 7216.51 ± 1.77 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp8192 | 6319.98 ± 7.82 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | tg128 | 167.77 ± 1.56 |
GGML_VK_VISIBLE_DEVICES=1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192
ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | test | t/s |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp512 | 1327.76 ± 17.68 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp2048 | 1252.70 ± 5.86 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp8192 | 960.10 ± 2.37 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | tg128 | 52.29 ± 0.15 |
And the most interesting part: testing both GPUs working together with tensor split via Vulkan. The model weights were distributed between the NVIDIA RTX 5070 Ti VRAM and the AMD Radeon 8060S UMA in the following proportions: 100%/0%, 90%/10%, 80%/20%, 70%/30%, 60%/40%, 50%/50%, 40%/60%, 30%/70%, 20%/80%, 10%/90%, 0%/100%.
GGML_VK_VISIBLE_DEVICES=0,1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -dev vulkan0/vulkan1 -ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10 -n 128 -p 512 -r 10
ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | dev | ts | test | t/s |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 10.00 | pp512 | 7461.22 ± 6.37 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 10.00 | tg128 | 168.91 ± 0.43 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 9.00/1.00 | pp512 | 5790.85 ± 52.68 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 9.00/1.00 | tg128 | 130.22 ± 0.40 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 8.00/2.00 | pp512 | 4230.90 ± 28.90 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 8.00/2.00 | tg128 | 112.66 ± 0.23 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 7.00/3.00 | pp512 | 3356.88 ± 27.64 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 7.00/3.00 | tg128 | 99.83 ± 0.20 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 6.00/4.00 | pp512 | 2658.89 ± 13.26 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 6.00/4.00 | tg128 | 85.67 ± 2.50 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 5.00/5.00 | pp512 | 2185.28 ± 16.92 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 5.00/5.00 | tg128 | 76.73 ± 1.13 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 4.00/6.00 | pp512 | 1946.46 ± 19.60 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 4.00/6.00 | tg128 | 62.84 ± 0.15 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 3.00/7.00 | pp512 | 1644.25 ± 29.88 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 3.00/7.00 | tg128 | 58.38 ± 0.31 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 2.00/8.00 | pp512 | 1458.99 ± 19.70 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 2.00/8.00 | tg128 | 55.70 ± 0.49 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 1.00/9.00 | pp512 | 1304.67 ± 45.80 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 1.00/9.00 | tg128 | 54.16 ± 1.07 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 0.00/10.00 | pp512 | 1194.55 ± 5.25 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 0.00/10.00 | tg128 | 52.62 ± 0.72 |
During token generation with split layers, the drop in overall tg and pp speed follows Amdahl's Law. Moving even a small fraction of layers to lower-bandwidth memory creates a bottleneck, leading to a non-linear drop in overall speed (t/s). If you graph it, it forms a classic hyperbola.
https://preview.redd.it/8frnjhri7utg1.jpg?width=1600&format=pjpg&auto=webp&s=2577562f66d60ba572670cea11bad2da588c6256
Formula: P(s) = 100 / [1 + s(k - 1)]
Where:
- P(s) = total system speed (in % of max eGPU speed).
- s = fraction of the model offloaded to the slower APU RAM (from 0 to 1, where 0 is all in VRAM and 1 is all in RAM).
- k = memory bandwidth gap ratio. Calculated as max speed divided by min speed (k = V_max / V_min).
As you can see, the overall tg and pp speeds depend only on the tg and pp of each node. OCuLink doesn't affect the overall speed at all.
Detailed Conclusions & Technical Analysis:
Based on the benchmark data and the architectural specifics of LLMs, here is a deeper breakdown of why we see these results.
1. Vulkan is the Ultimate API for Cross-Vendor Inference
Historically, mixing AMD and NVIDIA chips for compute tasks in a single pipeline has been a driver nightmare. However, llama.cpp's Vulkan backend completely changes the game.
- The Justification: Vulkan abstracts the hardware layer, standardizing the matrix multiplication math across entirely different architectures (RDNA 3.5 on the APU and the Ada/Blackwell architecture on the RTX 5070 Ti).
- The Result: It allows for seamless, stable pooling of discrete VRAM and system UMA memory. The performance penalty compared to highly optimized, native backends like CUDA or ROCm is practically negligible (only about 5–10%). You lose a tiny fraction of raw speed to the API translation layer, but you gain the massive advantage of fitting larger models across different hardware ecosystems without crashing.
2. The OCuLink Myth: PCIe 4.0 x4 is NOT a Bottleneck for LLMs
There is a widespread stereotype in the eGPU community that the limited bandwidth of OCuLink (~7.8 GB/s or 64 Gbps) will throttle AI performance. For LLM inference, this is completely false. The OCuLink bandwidth is utilized by a mere 1% during active generation. Here is the math behind why the communication penalty is practically zero:
- Token Generation (Decode Phase): Thanks to the Transformer architecture, GPUs do not send entire neural networks back and forth. When the model is split across two devices, they only pass a small tensor of hidden states (activations) for a single token at a time. For a 7B or even a 70B model, this payload is roughly a few dozen Kilobytes. Sending kilobytes over a 7.8 GB/s connection takes fractions of a microsecond.
- Context Processing (Prefill Phase): Even when digesting a massive prompt of 10,000+ tokens, llama.cpp processes the data in chunks (typically 512 tokens at a time). A 512-token chunk translates to just a few Megabytes of data transferred across the PCIe bus. Moving 8MB over OCuLink takes about 1 millisecond. Meanwhile, the GPUs take tens or hundreds of milliseconds to actually compute that chunk.
- The True Bottleneck: System speed is dictated entirely by the Memory Bandwidth of the individual nodes (RTX 5070 Ti at ~900 GB/s vs APU at ~200 GB/s), not the PCIe connection between them. The only scenarios where OCuLink's narrow bus will actually hurt you are the initial loading of the model weights from your SSD/RAM into the eGPU (taking 3–4 seconds instead of 1) or during full fine-tuning, which requires constantly moving massive arrays of gradients.
3. Amdahl’s Law and the "Relay Race" Pipeline Stalls
When using Tensor Splitting across multiple devices at batch size 1 (standard local inference without micro-batching), llama.cpp executes a strictly sequential pipeline.
- The Justification: Layer 2 cannot be computed until Layer 1 is finished. If you put 80% of the model on the lightning-fast RTX 5070 Ti and 20% on the slower AMD APU, they do not work simultaneously. The RTX processes its layers instantly, passes the tiny activation tensor over OCuLink, and then goes to sleep (Pipeline Stall). It sits completely idle, waiting for the memory-bandwidth-starved APU to grind through its 20% share of the layers.
- The Result: You are not adding compute power; you are adding a slow runner to a relay race. Because the fast GPU is forced to wait, the performance penalty of offloading layers to slower system memory is non-linear. As shown in the data, it perfectly graphs out as a classic hyperbola governed by Amdahl's Law. Moving just 10-20% of the workload to the slower node causes a disproportionately massive drop in total tokens per second.
System Configuration:
- Base: Minisforum MS-S1 Max (Strix Halo APU, AMD Radeon 8060S iGPU, RDNA 3.5 architecture). Quiet power mode.
- RAM: 128GB LPDDR5X-8000 (iGPU memory bandwidth is ~210 GB/s in practice, theoretical is 256 GB/s).
- OS: CachyOS (Linux 6.19.11-1-cachyos) with the latest Mesa driver (RADV). Booted with GRUB params:
GRUB_CMDLINE_LINUX="... iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856"
eGPU Setup:
- GPU: NVIDIA RTX 5070 Ti
- To get an OCuLink port on the Minisforum MS-S1 Max, I added a PCIe 4.0 x4 to OCuLink SFF8611/8612 adapter.
- Dock: I bought a cheap F9G-BK7 eGPU dock. PSU is a 1STPLAYER NGDP Gold 850W.
- Everything worked right out of the box, zero compatibility issues.
submitted by