tok/s on ASUS Zenbook A16 (Snapdragon X2)

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The post reports benchmark-like observations for running LLMs on the ASUS Zenbook A16 with the Snapdragon X2 Elite Extreme (Qualcomm Oryon Gen 3) using llama.cpp on Windows on ARM.
Hardware details include 18 CPU cores, 48GB unified memory with ~228GB/s peak bandwidth, and availability of an Adreno GPU and a Hexagon NPU, though both were not successfully leveraged for inference in the reported tests.
The author could not get KleidiAI (SME2) to run and also failed to produce usable GPU output with Adreno in llama.cpp; as a result, all tests described were CPU-only.
For practical usage, Qwen3.6 (including a Q5_K_M quantized variant) is described as usable even on battery, with the provided table comparing throughput across quantizations and architectures.
The next goal is to run a Whisper model on the 100% NPU to enable low-power dictation for tools like CC and opencode.

just quick numbers for anyone interested on new snapdragon chipset with windows on arm via llama.cpp

## Hardware

- Snapdragon X2 Elite Extreme (X2E94100, Qualcomm Oryon Gen 3)

- 18 cpu cores

- 48 GB Unified Memory

- ~228 GB/s peak memory bandwidth

- Adreno GPU (unused)

- Decent Hexagon NPU (unused)

- ISA features reported: NEON, FMA, DOTPROD, I8MM, SVE/SVE2, SME/SME2, fp16

- 4096-bit Matrix Engine (SME2) — present in hardware

i couldnt get KleidiAI (SME2) to work (guessing windows problem?)

llama.cpp does recognize and try to use the adreno gpu, but everything ive tried get adreno gpu to 100% but never see output. So all tests below are CPU only with the unified memory

been using Q5 qwen3.6 in opencode and its actually pretty usable! not the fastest but its great fun to be able to run it locally, even on battery it chugs along no problem. been impressed with this laptop so far

next project is getting whisper model running on 100% NPU (qlcom has some literature on this, hopefully works nice so i can dictate to CC and opencode on low power draw)

### Q4_K_M comparison across architectures | Model | Architecture | Size | Active | PP512 | TG128 | |---|---|---:|---|---:|---:| | Qwen3-4B | dense | 2.32 GiB | 4B | 248 t/s | 42 t/s | | Gemma-4-31B-it | dense | 18.24 GiB | 31B | 39 t/s | **6.5 t/s** | | Gemma-4-26B-A4B-it | MoE | 15.63 GiB | ~4B | 168 t/s | 31 t/s | | Qwen3.6-35B-A3B | MoE | 19.91 GiB | ~3B | 171 t/s | 33 t/s | ### Qwen3.6-35B-A3B quant + runtime config comparison | Quant | Size | KV config | PP512 | TG128 | |---|---:|---|---:|---:| | Q4_K_M | 19.91 GiB | fp16, no FA | 171 | 33.0 | | Q5_K_M | 23.29 GiB | fp16, no FA | 153 | 30.4 | | **Q5_K_M** | **23.29 GiB** | **q8_0 KV + FA (opencode)** | **145** | **29.6** |

submitted by /u/Hotschmoe
[link] [comments]