Llama.cpp now with a true reasoning budget!

Reddit r/LocalLLaMA / 3/12/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

llama.cpp now supports a real reasoning budget by counting tokens and terminating reasoning when the budget is reached, implemented via the sampler mechanism.
A new flag --reasoning-budget-message inserts a message just before the end of reasoning to ease the transition when the budget is exceeded.
Early tests show enforcing a reasoning budget can significantly hurt performance (e.g., Qwen3 9B HumanEval scores dropping from 94% to 78%), but the messaging option can recover scores to around 89% with a 1000-token budget.
Users are encouraged to experiment with different models and settings; forcing strong thinking on some models (like StepFun 3.5) with --reasoning-budget 0 can lead to erratic behavior such as attempting a second reasoning block.

Llama.cpp now with a true reasoning budget!

I'm happy to report that llama.cpp has another nice and exciting feature that I know a lot of you have been waiting for - real support for reasoning budgets!

Until now, `--reasoning-budget` was basically a stub, with its only function being setting it to 0 to disable thinking via passing `enable_thinking=false` to templates. But now, we introduce a real reasoning budget setting via the sampler mechanism. When the reasoning starts, we count the number of tokens and when the given number of reasoning tokens is reached, we force terminating the reasoning.

However: doing this "just like that" might not have a good effect on the model. In fact, when I did that on Qwen3 9B (testing it on HumanEval), its performance cratered: from 94% in the reasoning version and 88% in the non-reasoning version to a terrible 78% with an enforced reasoning budget. That's why we've added another flag: `--reasoning-budget-message`. This inserts a message right before the end of reasoning to ease the transition. When I used a message of "... thinking budget exceeded, let's answer now.", the score bumped back and the returns from partial reasoning started being visible, though not very large - got a respective HumanEval score of 89% with reasoning budget 1000.

I invite you to experiment with the feature, maybe you can find some nice settings for different models. You can even force models that are strongly thinking by default (i.e. StepFun 3.5) to limit reasoning, though with those models using --reasoning-budget 0 (which now restricts reasoning to none by sampler, not by template) results in some pretty erratic and bad behavior (for example they try to open a second reasoning block).

submitted by /u/ilintar
[link] [comments]