AI Navigate

How do i specify which gpu to use for kv cache? How to offload expert tensors to specific gpu?

Reddit r/LocalLLaMA / 3/16/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The author is asking for a method to specify which GPU to use for the KV cache in llama.cpp / LocalLLaMA.
  • They want to offload expert tensors to a specific GPU and place non-expert tensors on the stronger GPU in a two-GPU setup.
  • The goal is to optimize resource usage by dedicating the strong GPU to critical tensors and the weak GPU to the rest.
  • The post is crossposted from a GitHub discussion and includes a link to that discussion.
  • It seeks practical guidance or code-level solutions to achieve per-tensor GPU offloading in multi-GPU inference.

I crossposted this from here ( https://github.com/ggml-org/llama.cpp/discussions/20642 ), would love if anyone had an answer. I was looking how i could offload expert tensors to a specific gpu. And i am looking to find a way to do the same with the kv cache.

Reason being is that i have a weak and a strong gpu and i want only the non expert tensors on the strong gpu, while putting everything else on the weaker gpu.

submitted by /u/milpster
[link] [comments]