I asked the OG question to the kimi k2.5:
"I want to wash my car and the car wash is just 10 metres away. Should I walk or drive there?"
Kimi-k2.5 via NIM -- Three Modes.
I tested three modes: no tools, XML pseudo-tools, and JSON schema tools. "Tools" here means web search + Python in a Docker sandbox. 3 tests were conducted in each mode.
| Mode | Correct (Drive) |
|---|---|
| No tools | 3/3 ✅ |
| XML pseudo-tools | 2/3 |
| JSON schema tools | 1/3 |
tool overhead seems to degrade intelligence
Confirming with a Chemistry Question
To double check, I ran one more test --this time a niche chemistry question.
Background: diatomic molecules with even electron counts are generally diamagnetic, with two standard exceptions (10e and 16e systems). There's a lesser-known extension-- the entire oxygen family (O₂, S₂, Se₂, Te₂...) are all paramagnetic, not just O₂.
I asked:
"I remember for finding whether a compound is para or diamagnetic we used the odd even electron rule, but there were 2 exceptions, 10 and 16 electrons. Are there any more exceptions?"
| Mode | Result |
|---|---|
| No tools | ✅ Correctly identified O₂ family -- S₂, Se₂, Te₂ all paramagnetic |
| XML pseudo-tools | answered- "No more exceptions to remember" , this is failure ofc. |
| JSON schema tools | Similar failure |
Conclusion
The model had the correct answer in both cases --it just couldn't access it when tools were present. Tool schemas seem to push the model into "delegation mode" where it looks for something to search or execute rather than reasoning from its own knowledge. No tools = full attention on the problem.
i tested car wash test with qwen 3.5 also and found success in no tool mode and failure in tool mode.
Limitations
- Only tested on Kimi-k2.5, qwen 3.5
- 3 runs per mode is a small sample
[link] [comments]



