I tried to benchmark TurboQuant on Android (Snapdragon 7s Gen 3) — here's what actually happened

Reddit r/LocalLLaMA / 3/31/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A Reddit user attempted to benchmark TurboQuant on an Android phone (Snapdragon 7s Gen 3, CPU-only) using a Termux-native workflow and GitHub Actions cross-compilation, but hit multiple build-system/flag issues.
  • The cross-compile succeeded and the binary ran, yet the resulting build did not include the expected TurboQuant (tq3_0) type registration, indicating the branch wasn’t fully integrated with llama.cpp’s TurboQuant type system.
  • The post concludes TurboQuant for ARM CPU is not ready on mobile because the available community validations are for CUDA/Apple Metal rather than Android ARM, and the closest ARM reference code isn’t merged.
  • It notes open upstream PRs (#21088/#21089) and that once merged, TurboQuant should provide a reported ~4.4× KV compression memory win that could materially expand mobile context lengths by reducing OOM risk.
  • The author points to a public CI workflow that cross-compiles and verifies tq3_0 presence in the produced binary automatically, with plans to publish benchmark results after the upstream merge.
I tried to benchmark TurboQuant on Android (Snapdragon 7s Gen 3) — here's what actually happened

Building a sovereign Android dev stack from a single phone. No PC. Termux-native. When TurboQuant dropped last week I immediately wanted to know: does this work on ARM CPU-only? Nobody had tested it on mobile hardware.

My setup:

Xiaomi Redmi Note 14 Pro+ 5G

Snapdragon 7s Gen 3 (ARMv8-A, 8GB RAM)

Termux native, Android 16

No GPU offload (Adreno 730 rejects Qwen3.5 Hybrid Linear Attention kernels)

What I did:

Built the Aaryan-Kapoor turboquant-tq3_0 branch via GitHub Actions cross-compile (can't build on-device — 8GB RAM, -j2 max). Flags: -march=armv8-a+dotprod+i8mm, CPU-only, no NDK.

5 failed builds. Each one taught me something:

llama-server is not a valid target in this branch

CMAKE_SYSTEM_NAME=Android pulls in NDK clang → POSIX_MADV_WILLNEED undefined

Without CMAKE_SYSTEM_NAME=Linux + SYSTEM_PROCESSOR=aarch64, cmake injects -mavx2 -msse4.2 into an ARM build

The result:

Source: turboquant-tq3_0

TQ3_0: false

Target: aarch64 ARMv8-A+dotprod+i8mm

Build succeeded. Binary runs. But strings finds no tq3_0 type registered in the binary. The branch exists, compiles cleanly, but the GGML type registration for TurboQuant isn't merged into this branch yet as of 2026-03-30.

What this means:

TurboQuant on ARM CPU is not ready. The community implementations (turboquant_plus, TheTom's fork) are validated on Apple Silicon Metal and CUDA. The Aaryan-Kapoor CPU reference implementation is the closest thing to ARM-compatible code, but it's not integrated into llama.cpp's type system yet.

The upstream PR (#21088/#21089) is open. When it lands, the memory win (~4.4x KV compression) would matter enormously for 8GB mobile devices — the difference between 4K and 32K context without OOM.

The CI workflow is public: github.com/weissmann93/neobildOS — .github/workflows/build-llama-tq3.yml. Cross-compiles llama.cpp for ARM64 from any machine, checks for TQ3_0 presence in the binary. When the upstream PR merges, re-run and the check goes green automatically.

Will post benchmark numbers (q8_0 baseline vs TQ3_0 when it lands) as a follow-up.

submitted by /u/NeoLogic_Dev
[link] [comments]