AI Navigate

Llama.cpp auto-tuning optimization script

Reddit r/LocalLLaMA / 3/11/2026

📰 NewsTools & Practical Usage

Key Points

  • A new auto-tuning script for llama.cpp called ik_llama.cpp has been created to optimize token processing speed on mixed GPU setups such as 3090ti, 4070, and 3060 combinations.
  • The script removes the need for manual flag configuration and helps avoid out-of-memory (OOM) crashes, enhancing stability and ease of use.
  • The tool is available on GitHub, providing users a practical solution to maximize performance of LLaMA models on heterogeneous hardware systems.
  • This optimization is particularly useful for users running llama.cpp on local or personal multi-GPU setups where manual tuning is complex.
  • The solution reflects ongoing community-driven efforts to improve accessibility and performance of local LLaMA model deployments.

I created a auto-tuning script for llama.cpp,ik_llama.cpp that gets you the max tokens per seconds on weird setups like mine 3090ti + 4070 + 3060.

No more Flag configuration, OOM crashing yay

https://github.com/raketenkater/llm-server

submitted by /u/raketenkater
[link] [comments]