(llama.cpp) Possible to disable reasoning for some requests (while leaving reasoning on by default)?

Reddit r/LocalLLaMA / 4/15/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post asks whether llama.cpp (via llama-server) can selectively disable “reasoning” for certain requests while keeping it enabled by default for others.
  • The motivation is latency: the user wants faster responses for interactive chat use cases without sacrificing reasoning in other scenarios.
  • It specifically references running a GGUF model (gemma-4-26B variant) through llama-server with reasoning enabled.
  • The question seeks configuration or request-level control (e.g., parameters or API flags) to toggle reasoning behavior per call.
  • Overall, the discussion is about practical performance tuning for local LLM serving rather than a new model or release event.

I am running unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf with llama-server (with reasoning enabled).

Is it possible to disable reasoning for some requests only? If yes, how?

I want to leave reasoning on by default, but in some other use cases I want it to respond as fast as possible (e.g. chatting bot)

submitted by /u/regunakyle
[link] [comments]