Gemma 4 E2B runs surprisingly well on my 8GB Android phone, so I built a private voice notes app around it.

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author reports running Gemma 4 E2B locally on an 8GB Android phone and finding chat quality acceptable while being especially impressed by reliably structured, parseable JSON output.
  • Based on that behavior, they built a private Android voice-notes app that transcribes speech with Whisper Small and uses Gemma to split rambling notes into separate, tagged reminder items with resolved timing.
  • They describe end-to-end latency for a 10–15 second voice note as roughly 12–15 seconds total, with transcription around ~5 seconds and categorization/splitting around ~8–10 seconds, plus overhead for model loading, storage, and UI updates.
  • For searching, the app expands and rewrites user queries into keyword/hypothetical examples, merges multiple FTS retrieval lanes via reciprocal rank fusion, and optionally reranks top results with a reranker timeout.
  • The post invites others to share what local LLM models they run on phones and asks specifically whether categorization remains robust on real-world notes and how first-run behavior differs across devices.

Been running Gemma 4 E2B locally on my OnePlus CE 5 (8GB RAM) for a few months. Chat quality is fine for the size. What surprised me was JSON output. Short input, give it a structured prompt, you get clean parse able JSON back. Way better than I expected from a 2.4GB model on a phone.

Got me thinking about voice notes. You ramble for a few seconds, "call the dentist tomorrow at 3, also buy milk on the way home", and Gemma can split that into separate items, tag each one (reminder, buy), resolve the time. Tried it for a few weeks. Categorization is actually decent on real notes, not just the toy ones I started with.

Built an Android app around it. Whisper Small (244MB) for transcription via Sherpa-ONNX, Gemma 4 E2B (2.4GB) for the splitting and categorization via LiteRT-LM. Both run on the phone, no cloud, no account.

End-to-end on the CE 5, a typical 10-15 second voice note takes about 12-15s. Whisper does transcription in ~5s, Gemma categorizes in ~8-10s, rest is model load + Room writes + UI hop.

At search time( for eacmple -> "what did I say about the dentist last week") it does query expansion, rewriting the user's question into keywords plus hypothetical example items before retrieval. Multiple FTS lanes get merged with reciprocal rank fusion, then there's an optional Gemma reranker pass over the top-K with a 15s timeout and fallback to RRF order if it doesn't finish.

Curious what people here are doing with local LLMs on their phones lately. Any other good models to try out for local device.
If anyone wants to try it on their own device and share feedback, happy to share it . Mostly looking to know if the categorization holds up on real notes and any weirdness on first model

submitted by /u/Effective-Drawer9152
[link] [comments]