Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing

Reddit r/LocalLLaMA / 5/2/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The maintainer describes “Box,” a fork of Google’s AI Edge Gallery, built to run a fully offline AI assistant on Android with no cloud inference or accounts.
The project experiments with a hybrid on-device stack combining llama.cpp (GGUF LLM), whisper.cpp (offline STT), stable-diffusion.cpp (image generation), and LiteRT for execution.
It enables multimodal capabilities including streaming voice-to-voice conversation and live camera frame + natural-language Q&A, while also supporting local document context ingestion and custom GGUF model import.
A key architectural takeaway is that hybrid LiteRT + llama.cpp inference performs better than expected on newer Snapdragon/Pixel NPUs, and that model routing (CPU/GPU/NPU/TPU) often matters more than raw model size.
The author notes that for many mobile scenarios, memory usage and persistence become the main bottlenecks before compute, and they’re seeking technical feedback on quantization, runtime routing, multimodal pipelines, and performance tuning.

Hi everyone,

I’m the maintainer of Box — a fork of Google’s AI Edge Gallery that I’ve been extending into a fully offline AI assistant for Android.

Full disclosure: I built this project.

It runs entirely on-device (no cloud, no accounts, no external inference), and combines multiple local inference backends in a single app.

The goal was to see how far a fully offline mobile AI stack could be pushed using:

All running on Android with hardware acceleration where available (GPU / NPU / TPU).

What I’ve found interesting while building this:

LiteRT + llama.cpp hybrid inference works better than expected on newer Snapdragon/Pixel NPUs
Model routing matters more than raw model size on mobile
Whisper.cpp is still the most stable STT layer for fully offline setups
Memory + persistence becomes the real bottleneck before compute in many cases