How do i specify which gpu to use for kv cache? How to offload expert tensors to specific gpu?

Reddit r/LocalLLaMA / 3/16/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The author is asking for a method to specify which GPU to use for the KV cache in llama.cpp / LocalLLaMA.
They want to offload expert tensors to a specific GPU and place non-expert tensors on the stronger GPU in a two-GPU setup.
The goal is to optimize resource usage by dedicating the strong GPU to critical tensors and the weak GPU to the rest.
The post is crossposted from a GitHub discussion and includes a link to that discussion.
It seeks practical guidance or code-level solutions to achieve per-tensor GPU offloading in multi-GPU inference.

I crossposted this from here ( https://github.com/ggml-org/llama.cpp/discussions/20642 ), would love if anyone had an answer. I was looking how i could offload expert tensors to a specific gpu. And i am looking to find a way to do the same with the kv cache.

Reason being is that i have a weak and a strong gpu and i want only the non expert tensors on the strong gpu, while putting everything else on the weaker gpu.

submitted by /u/milpster
[link] [comments]

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

How do i specify which gpu to use for kv cache? How to offload expert tensors to specific gpu?

Key Points

I crossposted this from here ( https://github.com/ggml-org/llama.cpp/discussions/20642 ), would love if anyone had an answer. I was looking how i could offload expert tensors to a specific gpu. And i am looking to find a way to do the same with the kv cache.

Related Articles

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

I crossposted this from here ( https://github.com/ggml-org/llama.cpp/discussions/20642 ), would love if anyone had an answer. I was looking how i could offload expert tensors to a specific gpu. And i am looking to find a way to do the same with the kv cache.

Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像