Pi & Qwen3.5 with llama-cpp doing a lot of prompt re-processing

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The author reports that when using Pi with llama-cpp and a Qwen3.5 122B model (thinking enabled), agentic edits involving interleaved thinking and tool calls work correctly during the edits themselves.
On the subsequent user turn, llama-cpp logs show the “thinking” blocks from the prior context are dropped, causing the context cache to be invalidated and recomputed from the divergence point.
The issue appears in llama-cpp as a prompt/slot update where the new prompt diverges by not including the previous thinking block, leading to cache checkpoint invalidations.
The author asks whether this behavior is expected for Pi/llama-cpp prompt re-processing, or whether Pi is misconfigured or buggy in omitting prior thinking context.
They include their Pi `models.json` configuration, showing `reasoning: true` and a specified `compat.thinkingFormat` of `qwen-chat-template`, which they suspect may relate to the missing thinking blocks.

I've noticed an issue when I'm using Pi as a coding agent with llama-cpp, and I'm wondering if there's an issue with Pi or how I have it configured, or if this is just expected behavior.

I'm using Qwen3.5 122b with thinking enabled. When doing a bunch of agentic edits, it will do a lot of interleaving thinking and tool calls. This all works fine.

But then when it comes to my next turn providing input, I get a whole bunch of the context cache invalidated, because it looks like Pi is no longer sending over the thinking blocks. I see this in the llama-cpp log, where you can see that it diverged by dropping the thinking block:

srv params_from_: Chat format: peg-native slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.736 (> 0.100 thold), f_keep = 0.703 slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 3 | task 29044 | processing task, is_child = 0 slot update_slots: id 3 | task 29044 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 48112 slot update_slots: id 3 | task 29044 | old: ... <|im_start|>assistant | <think> The user is saying slot update_slots: id 3 | task 29044 | new: ... <|im_start|>assistant | You're right - ball-to slot update_slots: id 3 | task 29044 | 198 248045 74455 198 248068 198 760 1156 369 5315 slot update_slots: id 3 | task 29044 | 198 248045 74455 198 2523 2224 1245 471 4776 4534 slot update_slots: id 3 | task 29044 | n_past = 35407, slot.prompt.tokens.size() = 50377, seq_id = 3, pos_min = 50376, n_swa = 0

And then it goes on to invalidate a bunch of the context checkpoints and recomputes the cache from point that the history diverged, where the thinking context was dropped.

Now, I haven't dug into this too deeply yet, but I wanted to check: is this behavior expected? Do I have something configured wrong, or is Pi buggy in not sending thinking context from previous turns?

Here's the model config from my models.json in my Pi config:

 { "id": "unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q4_K_XL", "name": "Qwen3.5 122B-A10B (local)", "reasoning": true, "input": ["text", "image"], "contextWindow": 262144, "maxTokens": 65536, "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, "compat": { "thinkingFormat": "qwen-chat-template" } },

submitted by /u/annodomini
[link] [comments]