I fixed the issue with the reasoning budget beeing just a hard cutoff and the model dropped the mic mid sentence. This is not the most graceful variant to do it. Possibly Performance degradation also. But the model just reasons for minutes when not stopped.
I found that when after some budget a sentence is injected like:
"Final Answer:\nBased on my analysis above, "
The model keeps writing like it were its own idea and then finishes up gracefully with a summary.
I implemented this with a prompt injection flag. For example after 300 tokens and a rest budget for the the summary. The rest budget can be alot, like a few thousand tokens, and the model finishes up quickly after that in my tests.
I did not make pull request since "I" wrote this code with claude code. It worked as planned but the llama.cpp rules state that the no AI code is permitted for a PR and i dont want to overwhelm the maintainers with AI code. So I rather post my insights.
If someone wants to review the code and make PR feel free I am happy to share the code.
Cheers.
Tested successfully on qwen3.5 27b, 35ba3b and 9b.
Issue on github: https://github.com/ggml-org/llama.cpp/issues/20632
[link] [comments]




