Running SmolLM2‑360M on a Samsung Galaxy Watch 4 (380MB RAM) – 74% RAM reduction in llama.cpp

Reddit r/LocalLLaMA / 4/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author reports successfully running SmolLM2-360M on a Samsung Galaxy Watch 4 Classic with ~380MB available RAM by modifying llama.cpp/ggml memory handling.
  • They identify the main bottleneck as the model being effectively loaded twice (once via the APK mmap page cache and again via ggml tensor allocations), causing peak RAM to rise to ~524MB.
  • By passing host_ptr into llama_model_params, CPU tensors directly reference the mmap region while only Vulkan tensors are copied, avoiding redundant RAM usage.
  • On the watch, the change reduces peak RAM from ~524MB to ~142MB (about a 74% reduction) and improves boot time (19s → 11s, with ~2.5s on subsequent boots after mmap/KV cache warm-up).
  • The author shares code and proposes a PR to ggml-org/llama.cpp, seeking feedback on the host_ptr/mmap approach for embedded deployment.

I’ve got SmolLM2‑360M running on a Samsung Galaxy Watch 4 Classic (about 380MB free RAM) by tweaking llama.cpp and the underlying ggml memory model. By default, the model was being loaded twice in RAM: once via the APK’s mmap page cache and again via ggml’s tensor allocations, peaking at 524MB for a 270MB model.

The fix: I pass host_ptr into llama_model_params, so CPU tensors point directly into the mmap region and only Vulkan tensors are copied. On real hardware this gives:

  • Peak RAM: 524MB → 142MB (74% reduction)
  • First boot: 19s → 11s
  • Second boot: ~2.5s (mmap + KV cache warm)

Code:
https://github.com/Perinban/llama.cpp/tree/axon‑dev

Longer write‑up with VmRSS traces and design notes:
https://www.linkedin.com/posts/perinban-parameshwaran_machinelearning-llm-embeddedai-activity-7445374117987373056-xDj9?utm_source=share&utm_medium=member_desktop&rcm=ACoAAA1J2KoBHgKFnrEIUchmbOoZTpAqKKxKK7o

I’m planning a PR to ggml‑org/llama.cpp; feedback on the host‑ptr / mmap pattern is welcome.

submitted by /u/RecognitionFlat1470
[link] [comments]