I’ve got SmolLM2‑360M running on a Samsung Galaxy Watch 4 Classic (about 380MB free RAM) by tweaking llama.cpp and the underlying ggml memory model. By default, the model was being loaded twice in RAM: once via the APK’s mmap page cache and again via ggml’s tensor allocations, peaking at 524MB for a 270MB model.
The fix: I pass host_ptr into llama_model_params, so CPU tensors point directly into the mmap region and only Vulkan tensors are copied. On real hardware this gives:
- Peak RAM: 524MB → 142MB (74% reduction)
- First boot: 19s → 11s
- Second boot: ~2.5s (mmap + KV cache warm)
Code:
https://github.com/Perinban/llama.cpp/tree/axon‑dev
Longer write‑up with VmRSS traces and design notes:
https://www.linkedin.com/posts/perinban-parameshwaran_machinelearning-llm-embeddedai-activity-7445374117987373056-xDj9?utm_source=share&utm_medium=member_desktop&rcm=ACoAAA1J2KoBHgKFnrEIUchmbOoZTpAqKKxKK7o
I’m planning a PR to ggml‑org/llama.cpp; feedback on the host‑ptr / mmap pattern is welcome.
[link] [comments]




