Why are we actually sampling reasoning and output the same way?

Reddit r/LocalLLaMA / 4/24/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author observed that when prompting LLMs to reason in English while generating output in other languages, a single fixed sampling setup can cause grammar errors or low-quality text.
Lowering temperature helps output quality but can also make the English reasoning less creative and more deterministic.
They propose switching samplers mid-generation so “thinking” tokens use one set of sampling parameters (e.g., low temp) while the final output tokens use another set, effectively mimicking two API calls in a single run.
Implementing the idea in llama.cpp via parameters like “thinking_sampler_override,” “thinking_top_k,” “thinking_temp,” and “thinking_min_p” reportedly works quickly and produces different grammar/variation tradeoffs.
The post speculates that more advanced or complex samplers could behave differently and suggests that runtime control over sampler behavior might be useful.

I've started to notice that my usual setup doesn't work as well in other languages as it did in English - the model sometimes made grammar mistakes and generated genuine garbage. Its reasoning stayed in English and I preferred to leave it that way, as this is the language most LLM's are obviously most 'confident' in.

The answer to some of the problems of generating in less trained language was using lower temp. But then again, that influences reasoning, which is in English, and makes creative writing less 'creative'. Regenerating from the same context became deterministic.

So that gave me an idea - what if, based on the previous token generated, samplers swapped mid-generation? Basically the same as doing two API calls, one for thinking with one sampler preset, and the next (with thinking in the context) with other sampler preset. However, instead of doing it by hand, you just write a check in code.

So I pulled llamacpp repository and (kinda) implemented it in with a few lines from Claude. The concept is hacky and very simple, you'd need to pass a few additional API arguments:

"thinking_sampler_override": true,
"thinking_top_k": 128,
"thinking_temp": 0.0,
"thinking_min_p": 0.05,

llamacpp 'ignores' every other sampler you have and samples everything that is between thinking tokens only with these samplers. Surprisingly it worked almost right off the bat and provided some weird results. For example, on Gemma 4:

temp 1 for thinking + temp 0.0 for output: Best grammar in Ukrainian language so far, random and non-deterministic compared to temp 0 for everything
temp 0 for thinking + temp 1 for output: Is also varied between generations. Grammar is still a bit noisy but probably nice for writing in English(?)

That also makes me wonder how other, more complex samplers would react and work with this. Unfortunately I don't have a lot of time or knowledge in this area, so I can only comment on what I experienced.

Edit: Not saying this is anything, but perhaps having more control over samplers at runtime could be beneficial, instead of tweaking them before each generation?

submitted by /u/ReporterWeary9721
[link] [comments]