Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic

arXiv cs.AI / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper shows that large language models can produce fully correct step-by-step chain-of-thought reasoning while still outputting incorrect final answers, revealing a gap between “reasoning correctness” and “output correctness.”
It introduces the “Novel Operator Test” benchmark, which distinguishes operator logic from operator name by evaluating Boolean operator reasoning under unfamiliar naming conventions at multiple depths.
Experiments across five models (up to 8,100 problems each) demonstrate reasoning-output dissociation that existing benchmarks fail to detect, including cases like Claude Sonnet 4 where all observed errors had verifiably correct reasoning but wrong declared answers.
The study identifies two main failure modes: strategy failures at shallow depth (models over-rely on terse retrieval) and content failures at greater depth (models reason correctly but make systematic errors even after intervention).
A “trojan operator” experiment (relabeling XOR’s truth table with a novel name) indicates that name alone does not determine reasoning correctness, while some models show widening performance degradation as novelty increases.

Abstract

LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4's depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors post-intervention). A Trojan operator (XOR's truth table under a novel name) confirms name alone does not gate reasoning (p >= 0.49), while Llama's novelty gap widens to 28pp at depth 8-9 with the Trojan at 92-100%, isolating genuine difficulty with novel logic from name unfamiliarity.