Running llama.cpp on Snapdragon Hexagon NPU seems promising

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

Reddit投稿者が、Snapdragon 8 Gen 3（OnePlus 12）のHexagon NPU向けにllama.cppをクロスコンパイルして動作させたところ、発熱せずに推論が可能で、速度も概ねCPU実行と同程度〜実用域に達したと述べています。
gemma-3のQ4_0系GGUF（例：12B/4B, IT-qat-Q4_0）で、pp/tgのトークン速度が報告されており、特に小型モデルではQ&A用途に十分な体感になったとしています。
現状のHexagonバックエンドは対応量子化が限定的（Q4_0, IQ4_NL, MXFP4, Q8_0, F32のGGUFなど）で、KVキャッシュの量子化にも未対応です。
NPUが同時に扱えるメモリが4GBに制限されるため、モデル＋KVキャッシュが大きい場合は複数NPUデバイス（例：HTP0/HTP1等）を指定する必要があると説明されています。
QualcommのHexagon NPU対応は多くのPRが出ている一方、GPU（Adreno）やCPUとの併用・オフロード制御、また新しめチップ（SD 8 Elite Gen 5 / X2 Elite Extreme）が4GB制約をどう解消しているかは、ユーザー間で追加検証が求められています。

https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/snapdragon/README.md

I have an Oneplus 12 with Snapdragon 8 Gen 3. I followed the above README to cross-compile llama.cpp on Ubuntu and then copy to the Termux directory on the phone.

It seems like llama.cpp's Hexagon backend is highly supported by Qualcomm with many PRs made by Qualcomm employees.

I am getting 8t/s pp and 4.5t/s tg with gemma-3-12b-it-qat-Q4_0 and 20t/s pp 12.5t/s tg with gemma-3-4b-it-qat-Q4_0.

Speed is about the same as using the SD8G3's CPU but it is not hot at all and the tg speed is good enough for simple Q&As.

The limitations now is that it only supports Q4_0, IQ4_NL, MXFP4, Q8_0 and F32 ggufs. It also doesn't support KV cache quantization. While it supports chips as old as Snapdragon 888, since only SD8G2 or newer SoCs has Tensor module for LLMs, so probably shouldn't bother with it if your chip is too old.

Since Hexagon NPU can only address 4GB RAM, if you your model plus kv cache is too big, you need to set an environment variable to open more than one NPU device. Here is an example:

LD_LIBRARY_PATH=./lib:/vendor/lib64 ADSP_LIBRARY_PATH=./lib GGML_HEXAGON_NDEV=2 ./bin/llama-completion -m /sdcard/gguf/gemma-3-12b-it-qat-Q4_0.gguf -sys 'You are a helpful AI assistant' -ngl 99 --device HTP0,HTP1

My SD8G3's NPU has 34 INT8 TOPS and memory bandwidth of 76.8GB/s. Their latest product X2 Elite Extreme has 80 INT8 TOPS and 228GB/s. On the other hand, Nvidia 3090 has 248 INT8 TOPS and 936GB/s. So probably two or three gens to catch up?

PS By the way, does anyone own a SD 8 Elite Gen 5 Smartphone or a X2 Elite Extreme Laptop? If so, can you report your inference performance numbers? Supposedly, they can address more than 4GB RAM such that multiple HTP devices are not needed, is this supported by llama.cpp now?

PPS The Hexagon build supposedly is an OpenCL build also. Does anyone know how to offload LLMs to the Adreno GPU only? If I omit --device option, it seems to offload to both GPU and NPU without being any faster. Also, is it possible to use CPU, GPU and NPU together for maximum performance (albeit an ice pack might be needed)?

submitted by /u/Ok_Warning2146
[link] [comments]