Car Wash Mystery solved--Tool Call Degrades Intelligence.

Reddit r/LocalLLaMA / 4/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The author reports experiments with Kimi-k2.5 showing that using tool-calling (with web search + Python in a Docker sandbox) reduces accuracy on a simple “car wash” decision question compared with no-tools prompting.
  • Three modes were tested (no tools, XML pseudo-tools, and JSON schema tools), and correct answers dropped progressively when tools were enabled (3/3, 2/3, and 1/3 respectively).
  • A follow-up chemistry question produced the same pattern: the model knew the answer in no-tools mode but failed when tool schemas were present, suggesting the model shifts into a “delegation mode” rather than reasoning from internal knowledge.
  • The conclusion is that tool schema overhead and the presence of tools can degrade intelligence for some tasks, and similar behavior was observed when testing with Qwen 3.5.
  • Limitations include testing only two model variants and a small sample size (three runs per mode), so results may not generalize broadly.

I asked the OG question to the kimi k2.5:

"I want to wash my car and the car wash is just 10 metres away. Should I walk or drive there?"

Kimi-k2.5 via NIM -- Three Modes.

I tested three modes: no tools, XML pseudo-tools, and JSON schema tools. "Tools" here means web search + Python in a Docker sandbox. 3 tests were conducted in each mode.

Mode Correct (Drive)
No tools 3/3 ✅
XML pseudo-tools 2/3
JSON schema tools 1/3

tool overhead seems to degrade intelligence

Confirming with a Chemistry Question

To double check, I ran one more test --this time a niche chemistry question.

Background: diatomic molecules with even electron counts are generally diamagnetic, with two standard exceptions (10e and 16e systems). There's a lesser-known extension-- the entire oxygen family (O₂, S₂, Se₂, Te₂...) are all paramagnetic, not just O₂.

I asked:

"I remember for finding whether a compound is para or diamagnetic we used the odd even electron rule, but there were 2 exceptions, 10 and 16 electrons. Are there any more exceptions?"

Mode Result
No tools ✅ Correctly identified O₂ family -- S₂, Se₂, Te₂ all paramagnetic
XML pseudo-tools answered- "No more exceptions to remember" , this is failure ofc.
JSON schema tools Similar failure

Conclusion

The model had the correct answer in both cases --it just couldn't access it when tools were present. Tool schemas seem to push the model into "delegation mode" where it looks for something to search or execute rather than reasoning from its own knowledge. No tools = full attention on the problem.

i tested car wash test with qwen 3.5 also and found success in no tool mode and failure in tool mode.

Limitations

  • Only tested on Kimi-k2.5, qwen 3.5
  • 3 runs per mode is a small sample
submitted by /u/Spirited_Neck1858
[link] [comments]