I don't know why I haven't seen the rpc-server thing before. But what a gamechanger!
I been using smaller models for a while now, because i'm gpu poor. 27b dense has been out of the question at any kind of reasonable speed.
I love the qwen3.5 family. I love everyone who has ever contributed to llamacpp. I love unsloth. And everyone else! :D
My setup is a 12gb 4070 ti, i7-14700k with 64gb ddr4-3600 in 1 computer, and the 16gb vram amd rx6800, i5-11600k and 48gb ddr4-3200 in the other.
The 4070ti computer is win11, and the rx6800 computer is ubuntu 24.04, rocm 7.2 both running b8348 of llamacpp
My command on computer 2:
./rpc-server --host 0.0.0.0 -p 50052 -c
The caching feature is golden. First time a model is loaded it takes a minute or 2 to transfer it over the network, subsequent runs loads the cached tensors directly from disk. Blazing fast.
Then on main computer:
.\llama-server.exe -m D:\LLMs\unsloth\qwen3.5-27b-gguf\Qwen3.5-27B-UD-Q5_K_XL.gguf -c 84000 -ngl 99 --rpc 192.168.10.230:50052 --tensor-split 64,36 -t 8 --flash-attn on -ctk f16 -ctv f16 --parallel 1 --reasoning on --temp 0.7 --top-p 0.9 --min-p 0.05 --top-k 20 --repeat-penalty 1.1 --repeat-last-n 64
used opencode to fix an existing codebase to see how it would handle a half-decent large-ish prompt:
prompt eval time = 126132.09 ms / 33386 tokens ( 3.78 ms per token, 264.69 tokens per second)
eval time = 10325.83 ms / 134 tokens ( 77.06 ms per token, 12.98 tokens per second)
total time = 136457.92 ms / 33520 tokens
slot release: id 0 | task 0 | stop processing: n_tokens = 33519, truncated = 0
I could not be more happy. This is far beyond my expectations. all layers in gpu, full kv on gpu. hardly any traffic needs to travel the network apart from loading the model the first time. subsequent model loading of the same model is blazing fast.
84k context seems to be the maximum to keep the kv in gpu without any sysmem usage. But i can defently work with that, splitting up work between agents.
If anyone has any suggestions on anything i can do to improve this even further, don't hessitate to tell me!
Will test tool accuracy tomorrow. But I got high hopes :)
[link] [comments]




