AI Navigate

Llama CPP - any way to load model into VRAM+CPU+SSD with AMD?

Reddit r/LocalLLaMA / 3/19/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post investigates whether Llama CPP can run a giant model (around 170GB, e.g., Qwen3.5 397B Q3_K_S) by distributing data across VRAM, CPU RAM, and an SSD on an AMD system.
  • The user reports loading about 40GB into VRAM on a system with 48GB VRAM and observes the rest being accessed from SSD, with throughput around 0.11 tokens per second.
  • They ask whether this behavior is expected and request known best practices for heavy disk offloading and performance optimization with Llama CPP on AMD hardware.
  • The discussion is framed as a practical hardware/software optimization question rather than a new product release.

Doing the necessary pilgrimage of running a giant model (Qwen3.5 397B Q3_K_S ~170GB) on my system with the following specs:

  • 3950x

  • 64GB DDR4 (3000mhz in dual channel)

  • 48GB of VRAM (w6800 and Rx 6800)

  • 4TB Crucial P3 Plus (gen4 drive capped by pcie3 motherboard)

Havent had luck setting up ktransformers.. is Llama CPP usable for this? I'm chasing down something approaching 1 token per second but am stuck at 0.11 tokens/second.. but it seems that my system loads up the VRAM (~40GB) and then uses the SSD for the rest. I can't say "load 60GB into RAM at the start" it seems.

Is this right? Is there a known best way to do heavy disk offloading with Llama CPP?

submitted by /u/EmPips
[link] [comments]