Doing the necessary pilgrimage of running a giant model (Qwen3.5 397B Q3_K_S ~170GB) on my system with the following specs:
3950x
64GB DDR4 (3000mhz in dual channel)
48GB of VRAM (w6800 and Rx 6800)
4TB Crucial P3 Plus (gen4 drive capped by pcie3 motherboard)
Havent had luck setting up ktransformers.. is Llama CPP usable for this? I'm chasing down something approaching 1 token per second but am stuck at 0.11 tokens/second.. but it seems that my system loads up the VRAM (~40GB) and then uses the SSD for the rest. I can't say "load 60GB into RAM at the start" it seems.
Is this right? Is there a known best way to do heavy disk offloading with Llama CPP?
[link] [comments]




