To 16GB VRAM users, plug in your old GPU

Reddit r/LocalLLaMA / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • For users with 16GB VRAM trying to run dense ~30B LLMs, the article suggests plugging in an older GPU with at least 6GB VRAM to effectively increase available VRAM.
  • It explains that performance may still improve even if the second card is weaker, as long as the model components fit in VRAM across the two devices.
  • The post recommends using llama-server with a configuration that enables multi-GPU Vulkan backends, keeps model data out of system RAM (no-mmap and mlock=false), and tunes KV cache settings to reduce VRAM requirements.
  • A practical setup example is provided (e.g., 16GB RTX 5070 Ti + 6GB GTX 2060) along with guidance on checking device IDs via `llama-server.exe --list-devices`.
  • It reports benchmark-like results at large context sizes (e.g., ~71k actual context) showing much faster throughput than using a single 16GB card alone.

For those who want to run latest dense ~30b models and only have 16GB VRAM, if you have a old card with 6GB VRAM or more, plug it in.

It matters that everything fits on the VRAM, even on 2 cards. Even if one of them is quite weak.

I have a 5070Ti 16GB and a old 2060 6GB. The common idea is you need 2 same GPU to maximize performance. But one day I was strike by the idea, why not give it a try?

Let's see, if you did not bought a mother board just for LLM, it's very possible you have a true PCI-E x16 slot and a couple that looks like x16 but are actually wired with x4, just like me. That's a perfect slot for a old card.

16GB + 6GB = 22GB, it's getting close to the 24GB class card. If you have a better old card, lucky you!

Then you use llama-server with a config like this

[*] jinja = true cache-prompt = true n-gpu-layers = 999 no-mmap = true mlock = false np = 1 t = 0 [qwen/qwen3.6-27b] model = ./Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf mmproj = ./Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-BF16.gguf reasoning = on dev = Vulkan1,Vulkan2 c = 128000 no-mmproj-offload = true cache-type-k = q8_0 cache-type-v = q8_0 

A couple specific points:
- dev=Vulkan1,Vulkan2, this enables the two GPUs, run `llama-server.exe --list-devices` to see what you should set.
- no-mmap and mlock=false keeps the model away from your RAM
- np=1, no-mmproj-offload (or do not supply mmproj model), cache-type-k and cache-type-v to minimize VRAM needed
- n-gpu-layers=999 to prefer GPU offloading, well this may be unnecessary, but I'd keeps it
- split-mode=layer to split the layers asymmetrically across the device, "layer" is the default though so you don't see it above.
- c=128000 could be a little stretch, but works well enough for me.

BTW I also have intel integrated GPU that I plugged the monitors into, which is Vulkan0.

Some numbers, basically, at 128k max context, 71k actual context useage, pp=186t/s and tg=19t/s, quite usable speed compared to the 4t/s on single card.

[56288] prompt eval time = 5761.53 ms / 1076 tokens ( 5.35 ms per token, 186.76 tokens per second) [56288] eval time = 58000.15 ms / 1114 tokens ( 52.06 ms per token, 19.21 tokens per second) [56288] total time = 63761.69 ms / 2190 tokens [56288] slot release: id 0 | task 654 | stop processing: n_tokens = 71703, truncated = 0 
submitted by /u/akira3weet
[link] [comments]