To 16GB VRAM users, plug in your old GPU

Reddit r/LocalLLaMA / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

For users with 16GB VRAM trying to run dense ~30B LLMs, the article suggests plugging in an older GPU with at least 6GB VRAM to effectively increase available VRAM.
It explains that performance may still improve even if the second card is weaker, as long as the model components fit in VRAM across the two devices.
The post recommends using llama-server with a configuration that enables multi-GPU Vulkan backends, keeps model data out of system RAM (no-mmap and mlock=false), and tunes KV cache settings to reduce VRAM requirements.
A practical setup example is provided (e.g., 16GB RTX 5070 Ti + 6GB GTX 2060) along with guidance on checking device IDs via `llama-server.exe --list-devices`.
It reports benchmark-like results at large context sizes (e.g., ~71k actual context) showing much faster throughput than using a single 16GB card alone.

For those who want to run latest dense ~30b models and only have 16GB VRAM, if you have a old card with 6GB VRAM or more, plug it in.

It matters that everything fits on the VRAM, even on 2 cards. Even if one of them is quite weak.

I have a 5070Ti 16GB and a old 2060 6GB. The common idea is you need 2 same GPU to maximize performance. But one day I was strike by the idea, why not give it a try?

Let's see, if you did not bought a mother board just for LLM, it's very possible you have a true PCI-E x16 slot and a couple that looks like x16 but are actually wired with x4, just like me. That's a perfect slot for a old card.

16GB + 6GB = 22GB, it's getting close to the 24GB class card. If you have a better old card, lucky you!

Then you use llama-server with a config like this

[*] jinja = true cache-prompt = true n-gpu-layers = 999 no-mmap = true mlock = false np = 1 t = 0 [qwen/qwen3.6-27b] model = ./Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf mmproj = ./Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-BF16.gguf reasoning = on dev = Vulkan1,Vulkan2 c = 128000 no-mmproj-offload = true cache-type-k = q8_0 cache-type-v = q8_0

A couple specific points:
- dev=Vulkan1,Vulkan2, this enables the two GPUs, run `llama-server.exe --list-devices` to see what you should set.
- no-mmap and mlock=false keeps the model away from your RAM
- np=1, no-mmproj-offload (or do not supply mmproj model), cache-type-k and cache-type-v to minimize VRAM needed
- n-gpu-layers=999 to prefer GPU offloading, well this may be unnecessary, but I'd keeps it
- split-mode=layer to split the layers asymmetrically across the device, "layer" is the default though so you don't see it above.
- c=128000 could be a little stretch, but works well enough for me.

BTW I also have intel integrated GPU that I plugged the monitors into, which is Vulkan0.

Some numbers, basically, at 128k max context, 71k actual context useage, pp=186t/s and tg=19t/s, quite usable speed compared to the 4t/s on single card.

[56288] prompt eval time = 5761.53 ms / 1076 tokens ( 5.35 ms per token, 186.76 tokens per second) [56288] eval time = 58000.15 ms / 1114 tokens ( 52.06 ms per token, 19.21 tokens per second) [56288] total time = 63761.69 ms / 2190 tokens [56288] slot release: id 0 | task 654 | stop processing: n_tokens = 71703, truncated = 0

submitted by /u/akira3weet
[link] [comments]