Setup:
- CPU: AMD Ryzen 5 9600X
- RAM: 64GB DDR5
- GPU1 (host): RTX 5060ti 16GB
- GPU2 (VM passthrough → RPC): GTX 1080ti 11GB
- OS: Ubuntu 24.04
Exact models:
unsloth/Qwen3.5-35B-A3B-GGUF The Q4_K_M quant here
unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF The UD-Q4_K_M quant here
tl;dr
with my setup:
Qwen3.5-35B-A3B Q4_K_M runs at 60tok/sec
Nemotron-3-Super-120B-A12B UD-Q4_K_M runs at 3tok/sec
I've had a GTX 1080ti for years and years and finally hit a wall with models that require newer non-Pascal architecture, so I decided to upgrade to a 5060ti. I went to install the card when I thought... could I lash these together for a total of 27GB VRAM?? It turned out that, yes, I could, and quite effectively so.
Qwen3.5-35B-A3B
This was my first goal - it would prove that I could actually do what I wanted.
I tried a naive multi-GPU setup with llama.cpp, and met my first challenge - drivers. As far as I could tell, 5060ti requires 290-open or higher, and 1080ti requires 280-closed and lower. ChatGPT gave me some red herring about there being a single driver that might support both, but it was a dead end. What worked for me sounds much crazier, but made sense after the fact.
What ended up working was using virt-manager to create a VM and enabling passthrough such that the host no longer saw my 1080ti and it was exclusive to the guest VM. That allowed me to install proper drivers on each machine. Then I was led to take advantage of llama.cpp's wonderful RPC functionality to let things "just work". And they did. 60t/s was very nice and usable. I didn't expect that speed at all.
Note that if you try this, you need to build llama.cpp with -DGGML_CUDA=ON and -DGGML_RPC=ON
Run the guest VM RPC server with: .build/bin/rpc-server --device CUDA0 --host 0.0.0.0 -p 5052
On the host, get the IP of the guest VM by running hostname -I and then: ./build/bin/llama-cli -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf -ngl 999 --rpc the_ip_you_got:50052 --tensor-split 5,8 -p "Say hello in one sentence."
or run as a server with: ./build/bin/llama-server -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf -ngl 999 --rpc the_ip_you_got=:50052 --tensor-split 5,8 --port 8080 --host 0.0.0.0
Nemotron-3-Super-120B-A12B
The above setup worked without any further changes besides rebuilding llama.cpp and changing -ngl to use RAM too.
Note that it took several minutes to load and free -h reported all the memory that was being used as available despite it actually being taken up by the model. I also had some intermittent display freezing / unresponsiveness as inference was happening, but it didn't make things unusable.
This worked to check actual memory usage: grep -E 'MemAvailable|MemFree|SwapTotal|SwapFree|Cached|SReclaimable|Shmem|AnonPages|Mapped|Unevictable|Mlocked' /proc/meminfo
./build/bin/llama-cli -m ~/models/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_M-00001-of-00003.gguf -ngl 20 --rpc the_ip_you_got_earlier:50052 --tensor-split 5,8 -p "Say hello in one sentence."
I still need to read the guide at https://unsloth.ai/docs/models/nemotron-3-super to see what I can make faster if anything.
Does anyone have any insight as to whether or not I can squeeze unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 into my setup? Can weights be dequantized and offloaded to my 1080ti on the fly?
And AI assistants constantly say my tensor-split is backwards, but things OOM when I flip it, so... anyone know anything about that?
I'm happy to answer any questions and I'd welcome any critique on my approach or commands above. If there's much interest I'll try to put together a more in-depth guide.
[link] [comments]




