Running a local LLM on Android with Termux and llama.cpp

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

Read original →

共有:

Key Points

The post demonstrates how to run an open-source local LLM on an Android phone using Termux and llama.cpp, specifically using Qwen3.5-0.8B with GGUF quantization.
It outlines practical installation and model setup steps, including installing llama-cpp in Termux and opening a downloaded .gguf model file with Termux.
Users can interact with the model either via the terminal (llama-cli) or through a local browser UI (llama-server on localhost:8080).
Performance tests show that inference throughput (TPS) can improve by tuning parameters like CPU thread count (e.g., using “-t 6” increased TPS from ~3–4 to ~7–8).
The author notes that larger/newer models (e.g., an 8B 1-bit GGUF variant) required different setup and were not yet usable due to low TPS, indicating device-dependent tuning needs.

Running a local LLM on Android with Termux and llama.cpp

What I used

Samsung S21 Ultra
Termux
llama-cpp-cli
llama-cpp-server
Qwen3.5-0.8B with Q5_K_M quantization from huggingface
(I also tried Bonsai-8B-GGUF-1bit from huggingface. Although this is a newer model and required a different setup, which I might write about at a later time, it produced 2-3 TPS and I did not find that to be usable)

Installation

I downloaded the "Termux" app from the Google Play store and installed the needed tools in Termux:

 pkg update && pkg upgrade -y pkg install llama-cpp -y

Downloading a model

I downloaded Qwen3.5-0.8B-Q5_K_M.gguf in my phone browser and saved it to my device. Then I opened the download folder shortcut in the browser, selected the GGUF file -> open with: Termux

Now the file is accessible in Termux.

Running it in the terminal

After that, I loaded the model and started chatting through the command line.

llama-cli -m /path/to/model.gguf

Running it in the browser

I also tried to run the model in llama-server, which gives a more readable UI in your web browser, while Termux is running in the background. To do this, run the below command to start a local server and open it in the browser by writing localhost:8080 or 127.0.0.1:8080 in the address bar.

llama-server -m /path/to/model.gguf

With the previous command I had only achieved 3-4 TPS, and just by adding the parameter "-t 6", which dedicates 6 threads of the CPU for inference, output increased to 7-8 TPS. This is to show that there is potential to increase generation speed with various parameters.

llama-server -m /path/to/model.gguf -t 6

Conclusion

Running an open source LLM on my phone like this was a fun experience, especially considering it is a 2021 device, so newer phones should offer an even more enjoyable experience.

This is by no means a guide on how to do it best, as I have done only surface level testing. There are various parameters that can be adjusted, depending on your device, to increase TPS and achieve a more optimal setup.

Maybe this has motivated you to try this on your phone and I hope you find some of this helpful!

submitted by /u/Different_Drive_1095
[link] [comments]