Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing

Reddit r/LocalLLaMA / 5/2/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The maintainer describes “Box,” a fork of Google’s AI Edge Gallery, built to run a fully offline AI assistant on Android with no cloud inference or accounts.
  • The project experiments with a hybrid on-device stack combining llama.cpp (GGUF LLM), whisper.cpp (offline STT), stable-diffusion.cpp (image generation), and LiteRT for execution.
  • It enables multimodal capabilities including streaming voice-to-voice conversation and live camera frame + natural-language Q&A, while also supporting local document context ingestion and custom GGUF model import.
  • A key architectural takeaway is that hybrid LiteRT + llama.cpp inference performs better than expected on newer Snapdragon/Pixel NPUs, and that model routing (CPU/GPU/NPU/TPU) often matters more than raw model size.
  • The author notes that for many mobile scenarios, memory usage and persistence become the main bottlenecks before compute, and they’re seeking technical feedback on quantization, runtime routing, multimodal pipelines, and performance tuning.
Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing

Hi everyone,

I’m the maintainer of Box — a fork of Google’s AI Edge Gallery that I’ve been extending into a fully offline AI assistant for Android.

Full disclosure: I built this project.

It runs entirely on-device (no cloud, no accounts, no external inference), and combines multiple local inference backends in a single app.


What I’ve been experimenting with

The goal was to see how far a fully offline mobile AI stack could be pushed using:

  • llama.cpp (GGUF LLM inference)
  • whisper.cpp (on-device STT)
  • stable-diffusion.cpp (image generation)
  • LiteRT (Google’s on-device runtime)

All running on Android with hardware acceleration where available (GPU / NPU / TPU).


Current capabilities

  • Voice-to-voice conversation (streaming style, hands-free loop)
  • Vision + voice (live camera frame + natural language Q&A)
  • On-device image generation (Stable Diffusion via GGUF)
  • Document ingestion into context (local files)
  • Custom GGUF model import
  • Runs across CPU / GPU / NPU / TPU (auto-selected)

Architecture focus

What I’ve found interesting while building this:

  • LiteRT + llama.cpp hybrid inference works better than expected on newer Snapdragon/Pixel NPUs
  • Model routing matters more than raw model size on mobile
  • Whisper.cpp is still the most stable STT layer for fully offline setups
  • Memory + persistence becomes the real bottleneck before compute in many cases

Repo (for reference)

https://github.com/jegly/Box


Why I’m posting this here

I’m mainly sharing this for feedback from people also working on local inference systems, especially around:

  • mobile quantization strategies
  • hybrid runtime routing (CPU/GPU/NPU)
  • multimodal on-device pipelines
  • performance tuning on constrained hardware

Not trying to push adoption — more interested in technical critique than anything else.


Happy to answer questions or go deeper into any part of the stack if useful.

submitted by /u/Healthy_Bedroom5837
[link] [comments]