| What I used
InstallationI downloaded the "Termux" app from the Google Play store and installed the needed tools in Termux: Downloading a modelI downloaded Qwen3.5-0.8B-Q5_K_M.gguf in my phone browser and saved it to my device. Then I opened the download folder shortcut in the browser, selected the GGUF file -> open with: Termux Now the file is accessible in Termux. Running it in the terminalAfter that, I loaded the model and started chatting through the command line. Running it in the browserI also tried to run the model in llama-server, which gives a more readable UI in your web browser, while Termux is running in the background. To do this, run the below command to start a local server and open it in the browser by writing localhost:8080 or 127.0.0.1:8080 in the address bar. With the previous command I had only achieved 3-4 TPS, and just by adding the parameter "-t 6", which dedicates 6 threads of the CPU for inference, output increased to 7-8 TPS. This is to show that there is potential to increase generation speed with various parameters. ConclusionRunning an open source LLM on my phone like this was a fun experience, especially considering it is a 2021 device, so newer phones should offer an even more enjoyable experience. This is by no means a guide on how to do it best, as I have done only surface level testing. There are various parameters that can be adjusted, depending on your device, to increase TPS and achieve a more optimal setup. Maybe this has motivated you to try this on your phone and I hope you find some of this helpful! [link] [comments] |
Running a local LLM on Android with Termux and llama.cpp
Reddit r/LocalLLaMA / 4/6/2026
💬 OpinionSignals & Early TrendsTools & Practical Usage
Key Points
- The post demonstrates how to run an open-source local LLM on an Android phone using Termux and llama.cpp, specifically using Qwen3.5-0.8B with GGUF quantization.
- It outlines practical installation and model setup steps, including installing llama-cpp in Termux and opening a downloaded .gguf model file with Termux.
- Users can interact with the model either via the terminal (llama-cli) or through a local browser UI (llama-server on localhost:8080).
- Performance tests show that inference throughput (TPS) can improve by tuning parameters like CPU thread count (e.g., using “-t 6” increased TPS from ~3–4 to ~7–8).
- The author notes that larger/newer models (e.g., an 8B 1-bit GGUF variant) required different setup and were not yet usable due to low TPS, indicating device-dependent tuning needs.




