AI Navigate

running Qwen3.5-27B Q5 splitt across a 4070ti and an amd rx6800 over LAN @ 13t/s with a 32k prompt

Reddit r/LocalLLaMA / 3/15/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author demonstrates splitting Qwen3.5-27B across a 4070 Ti and an AMD RX6800 over LAN using rpc-server, achieving around 13 t/s with a 32k prompt.
  • The setup consists of a Windows 11 machine (4070 Ti, i7-14700K, 64 GB) and an Ubuntu 24.04 machine (RX6800, 48 GB) running llamacpp build b8348, with a caching feature that transfers the model once and serves cached tensors thereafter.
  • Caching speeds subsequent runs by avoiding repeated network transfers, and keeps the KV context on the GPU up to about 84k tokens to minimize sysmem usage, enabling split work across agents.
  • Performance metrics show a prompt evaluation time of 126,132 ms for 33,386 tokens, an eval time of 10,326 ms for 134 tokens, totaling 136,458 ms for 33,520 tokens.
  • The poster is highly satisfied, notes minimal network traffic after the first load, and asks for suggestions to further improve performance while planning to test tool accuracy next.

I don't know why I haven't seen the rpc-server thing before. But what a gamechanger!

I been using smaller models for a while now, because i'm gpu poor. 27b dense has been out of the question at any kind of reasonable speed.

I love the qwen3.5 family. I love everyone who has ever contributed to llamacpp. I love unsloth. And everyone else! :D

My setup is a 12gb 4070 ti, i7-14700k with 64gb ddr4-3600 in 1 computer, and the 16gb vram amd rx6800, i5-11600k and 48gb ddr4-3200 in the other.

The 4070ti computer is win11, and the rx6800 computer is ubuntu 24.04, rocm 7.2 both running b8348 of llamacpp

My command on computer 2:
./rpc-server --host 0.0.0.0 -p 50052 -c
The caching feature is golden. First time a model is loaded it takes a minute or 2 to transfer it over the network, subsequent runs loads the cached tensors directly from disk. Blazing fast.

Then on main computer:
.\llama-server.exe -m D:\LLMs\unsloth\qwen3.5-27b-gguf\Qwen3.5-27B-UD-Q5_K_XL.gguf -c 84000 -ngl 99 --rpc 192.168.10.230:50052 --tensor-split 64,36 -t 8 --flash-attn on -ctk f16 -ctv f16 --parallel 1 --reasoning on --temp 0.7 --top-p 0.9 --min-p 0.05 --top-k 20 --repeat-penalty 1.1 --repeat-last-n 64

used opencode to fix an existing codebase to see how it would handle a half-decent large-ish prompt:

prompt eval time = 126132.09 ms / 33386 tokens ( 3.78 ms per token, 264.69 tokens per second)

eval time = 10325.83 ms / 134 tokens ( 77.06 ms per token, 12.98 tokens per second)

total time = 136457.92 ms / 33520 tokens

slot release: id 0 | task 0 | stop processing: n_tokens = 33519, truncated = 0

I could not be more happy. This is far beyond my expectations. all layers in gpu, full kv on gpu. hardly any traffic needs to travel the network apart from loading the model the first time. subsequent model loading of the same model is blazing fast.

84k context seems to be the maximum to keep the kv in gpu without any sysmem usage. But i can defently work with that, splitting up work between agents.

If anyone has any suggestions on anything i can do to improve this even further, don't hessitate to tell me!
Will test tool accuracy tomorrow. But I got high hopes :)

submitted by /u/technot80
[link] [comments]