The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B)

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Read original →

共有:

Key Points

The project llm-server v2 introduces a new option, --ai-tune, which uses an LLM-driven loop to automatically tune llama.cpp flags and caches the best-performing configuration.
Benchmarks reported for Qwen3.5-27B show a large throughput increase (e.g., up to ~40 tok/s), compared with the prior tuning approach and a baseline llama-server configuration.
The tuning system is designed to stay compatible with ongoing llama.cpp/ik_llama.cpp changes by feeding llama-server --help output into the LLM as context, so new flags can be adopted without manual updates.
The author claims the approach also improves stability and adds a more polished operator experience via a TUI/GUI (llm-server-gui).
The work is shared as an update to an open-source repository, encouraging others to test and adopt the autotuning workflow for local LLM inference speed gains.

This is V2 of my previous post.

What's new: --ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds.

My weird rig: 3090 Ti + 4070 + 3060 + 128GB RAM.

Model	llama-server	llm-server v1 tuning	llm-server v2 (ai-tuning)
Qwen3.5-122B	4.1 tok/s	11.2 tok/s	17.47 tok/s
Qwen3.5-27B Q4_K_M	18.5 tok/s	25.94 tok/s	40.05 tok/s
gemma-4-31B UD-Q4_K_XL	14.2 tok/s	23.17 tok/s	24.77 tok/s

What I think is best here: --ai-tune keeps up with updates on llama.cpp / ik_llama.cpp automatically, because it feeds llama-server --help into the LLM tuning loop as context. New flags land → the tuner can use them → you get the best performance.

i think those are some solid gains (max tokens yeaaahh), plus more stability and a nice TUI via llm-server-gui.

Check it out: https://github.com/raketenkater/llm-server

submitted by /u/raketenkater
[link] [comments]