For those who want to run latest dense ~30b models and only have 16GB VRAM, if you have a old card with 6GB VRAM or more, plug it in.
It matters that everything fits on the VRAM, even on 2 cards. Even if one of them is quite weak.
I have a 5070Ti 16GB and a old 2060 6GB. The common idea is you need 2 same GPU to maximize performance. But one day I was strike by the idea, why not give it a try?
Let's see, if you did not bought a mother board just for LLM, it's very possible you have a true PCI-E x16 slot and a couple that looks like x16 but are actually wired with x4, just like me. That's a perfect slot for a old card.
16GB + 6GB = 22GB, it's getting close to the 24GB class card. If you have a better old card, lucky you!
Then you use llama-server with a config like this
[*] jinja = true cache-prompt = true n-gpu-layers = 999 no-mmap = true mlock = false np = 1 t = 0 [qwen/qwen3.6-27b] model = ./Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf mmproj = ./Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-BF16.gguf reasoning = on dev = Vulkan1,Vulkan2 c = 128000 no-mmproj-offload = true cache-type-k = q8_0 cache-type-v = q8_0 A couple specific points:
- dev=Vulkan1,Vulkan2, this enables the two GPUs, run `llama-server.exe --list-devices` to see what you should set.
- no-mmap and mlock=false keeps the model away from your RAM
- np=1, no-mmproj-offload (or do not supply mmproj model), cache-type-k and cache-type-v to minimize VRAM needed
- n-gpu-layers=999 to prefer GPU offloading, well this may be unnecessary, but I'd keeps it
- split-mode=layer to split the layers asymmetrically across the device, "layer" is the default though so you don't see it above.
- c=128000 could be a little stretch, but works well enough for me.
BTW I also have intel integrated GPU that I plugged the monitors into, which is Vulkan0.
Some numbers, basically, at 128k max context, 71k actual context useage, pp=186t/s and tg=19t/s, quite usable speed compared to the 4t/s on single card.
[56288] prompt eval time = 5761.53 ms / 1076 tokens ( 5.35 ms per token, 186.76 tokens per second) [56288] eval time = 58000.15 ms / 1114 tokens ( 52.06 ms per token, 19.21 tokens per second) [56288] total time = 63761.69 ms / 2190 tokens [56288] slot release: id 0 | task 654 | stop processing: n_tokens = 71703, truncated = 0 [link] [comments]




