AI Navigate

Graceful reasoning budget termination for qwen3.5 models in llama.cpp

Reddit r/LocalLLaMA / 3/16/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • A fix was implemented to stop the reasoning budget from acting as a hard cutoff and causing abrupt mid-sentence terminations in qwen3.5 models running under llama.cpp.
  • The solution uses a prompt injection flag to insert a final "Final Answer: Based on my analysis above" after a budget threshold, enabling the model to finish with a concise summary.
  • The remaining budget can be large, allowing the model to continue reasoning for minutes and then wrap up quickly once the summary is produced; tests were conducted on qwen3.5 27b, 35ba3b, and 9b.
  • The approach may incur performance degradation and is not the most graceful variant, but it achieves a graceful termination in the tested scenarios.
  • The author did not submit a PR due to llama.cpp guidelines about AI code, but invites others to review or contribute the code.

I fixed the issue with the reasoning budget beeing just a hard cutoff and the model dropped the mic mid sentence. This is not the most graceful variant to do it. Possibly Performance degradation also. But the model just reasons for minutes when not stopped.

I found that when after some budget a sentence is injected like:

"Final Answer:\nBased on my analysis above, "

The model keeps writing like it were its own idea and then finishes up gracefully with a summary.

I implemented this with a prompt injection flag. For example after 300 tokens and a rest budget for the the summary. The rest budget can be alot, like a few thousand tokens, and the model finishes up quickly after that in my tests.

I did not make pull request since "I" wrote this code with claude code. It worked as planned but the llama.cpp rules state that the no AI code is permitted for a PR and i dont want to overwhelm the maintainers with AI code. So I rather post my insights.

If someone wants to review the code and make PR feel free I am happy to share the code.

Cheers.

Tested successfully on qwen3.5 27b, 35ba3b and 9b.

Issue on github: https://github.com/ggml-org/llama.cpp/issues/20632

submitted by /u/marinetankguy2
[link] [comments]