Running Qwen3.5-35B-A3B and Nemotron-3-Super-120B-A12B on a 5060ti and 1080ti with llama.cpp (Fully on GPU for Qwen; 64GB RAM needed for Nemotron)

Reddit r/LocalLLaMA / 3/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The article documents running Qwen3.5-35B-A3B and Nemotron-3-Super-120B-A12B on a dual-GPU rig (RTX 5060ti + GTX 1080ti) with llama.cpp, achieving fully GPU-accelerated operation for Qwen and RAM-driven performance for Nemotron.
To enable this setup, the author uses VM passthrough via virt-manager to isolate each GPU so proper drivers can be installed on host and guest, enabling CUDA/RPC workflows.
It instructs building llama.cpp with -DGGML_CUDA=ON and -DGGML_RPC=ON, and provides commands to run an RPC server and query the model across the network from the host.
The Qwen model runs at about 60 tokens per second, while Nemotron runs at about 3 tokens per second in this configuration, with Nemotron requiring 64GB RAM and showing notable load times.

Setup:

CPU: AMD Ryzen 5 9600X
RAM: 64GB DDR5
GPU1 (host): RTX 5060ti 16GB
GPU2 (VM passthrough → RPC): GTX 1080ti 11GB
OS: Ubuntu 24.04

Exact models:

unsloth/Qwen3.5-35B-A3B-GGUF The Q4_K_M quant here

unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF The UD-Q4_K_M quant here

tl;dr

with my setup:

Qwen3.5-35B-A3B Q4_K_M runs at 60tok/sec

Nemotron-3-Super-120B-A12B UD-Q4_K_M runs at 3tok/sec

I've had a GTX 1080ti for years and years and finally hit a wall with models that require newer non-Pascal architecture, so I decided to upgrade to a 5060ti. I went to install the card when I thought... could I lash these together for a total of 27GB VRAM?? It turned out that, yes, I could, and quite effectively so.

Qwen3.5-35B-A3B

This was my first goal - it would prove that I could actually do what I wanted.

I tried a naive multi-GPU setup with llama.cpp, and met my first challenge - drivers. As far as I could tell, 5060ti requires 290-open or higher, and 1080ti requires 280-closed and lower. ChatGPT gave me some red herring about there being a single driver that might support both, but it was a dead end. What worked for me sounds much crazier, but made sense after the fact.

What ended up working was using virt-manager to create a VM and enabling passthrough such that the host no longer saw my 1080ti and it was exclusive to the guest VM. That allowed me to install proper drivers on each machine. Then I was led to take advantage of llama.cpp's wonderful RPC functionality to let things "just work". And they did. 60t/s was very nice and usable. I didn't expect that speed at all.

Note that if you try this, you need to build llama.cpp with -DGGML_CUDA=ON and -DGGML_RPC=ON

Run the guest VM RPC server with: .build/bin/rpc-server --device CUDA0 --host 0.0.0.0 -p 5052

On the host, get the IP of the guest VM by running hostname -I and then: ./build/bin/llama-cli -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf -ngl 999 --rpc the_ip_you_got:50052 --tensor-split 5,8 -p "Say hello in one sentence."

or run as a server with: ./build/bin/llama-server -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf -ngl 999 --rpc the_ip_you_got=:50052 --tensor-split 5,8 --port 8080 --host 0.0.0.0

Nemotron-3-Super-120B-A12B

The above setup worked without any further changes besides rebuilding llama.cpp and changing -ngl to use RAM too.

Note that it took several minutes to load and free -h reported all the memory that was being used as available despite it actually being taken up by the model. I also had some intermittent display freezing / unresponsiveness as inference was happening, but it didn't make things unusable.

./build/bin/llama-cli -m ~/models/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_M-00001-of-00003.gguf -ngl 20 --rpc the_ip_you_got_earlier:50052 --tensor-split 5,8 -p "Say hello in one sentence."

I still need to read the guide at https://unsloth.ai/docs/models/nemotron-3-super to see what I can make faster if anything.

Does anyone have any insight as to whether or not I can squeeze unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 into my setup? Can weights be dequantized and offloaded to my 1080ti on the fly?

And AI assistants constantly say my tensor-split is backwards, but things OOM when I flip it, so... anyone know anything about that?

I'm happy to answer any questions and I'd welcome any critique on my approach or commands above. If there's much interest I'll try to put together a more in-depth guide.

submitted by /u/sbeepsdon
[link] [comments]

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Running Qwen3.5-35B-A3B and Nemotron-3-Super-120B-A12B on a 5060ti and 1080ti with llama.cpp (Fully on GPU for Qwen; 64GB RAM needed for Nemotron)

Key Points

tl;dr

Qwen3.5-35B-A3B

Nemotron-3-Super-120B-A12B

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer