Been running Gemma 4 E2B locally on my OnePlus CE 5 (8GB RAM) for a few months. Chat quality is fine for the size. What surprised me was JSON output. Short input, give it a structured prompt, you get clean parse able JSON back. Way better than I expected from a 2.4GB model on a phone.
Got me thinking about voice notes. You ramble for a few seconds, "call the dentist tomorrow at 3, also buy milk on the way home", and Gemma can split that into separate items, tag each one (reminder, buy), resolve the time. Tried it for a few weeks. Categorization is actually decent on real notes, not just the toy ones I started with.
Built an Android app around it. Whisper Small (244MB) for transcription via Sherpa-ONNX, Gemma 4 E2B (2.4GB) for the splitting and categorization via LiteRT-LM. Both run on the phone, no cloud, no account.
End-to-end on the CE 5, a typical 10-15 second voice note takes about 12-15s. Whisper does transcription in ~5s, Gemma categorizes in ~8-10s, rest is model load + Room writes + UI hop.
At search time( for eacmple -> "what did I say about the dentist last week") it does query expansion, rewriting the user's question into keywords plus hypothetical example items before retrieval. Multiple FTS lanes get merged with reciprocal rank fusion, then there's an optional Gemma reranker pass over the top-K with a 15s timeout and fallback to RRF order if it doesn't finish.
Curious what people here are doing with local LLMs on their phones lately. Any other good models to try out for local device.
If anyone wants to try it on their own device and share feedback, happy to share it . Mostly looking to know if the categorization holds up on real notes and any weirdness on first model
[link] [comments]



