Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic

arXiv cs.AI / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that large language models can produce fully correct step-by-step chain-of-thought reasoning while still outputting incorrect final answers, revealing a gap between “reasoning correctness” and “output correctness.”
  • It introduces the “Novel Operator Test” benchmark, which distinguishes operator logic from operator name by evaluating Boolean operator reasoning under unfamiliar naming conventions at multiple depths.
  • Experiments across five models (up to 8,100 problems each) demonstrate reasoning-output dissociation that existing benchmarks fail to detect, including cases like Claude Sonnet 4 where all observed errors had verifiably correct reasoning but wrong declared answers.
  • The study identifies two main failure modes: strategy failures at shallow depth (models over-rely on terse retrieval) and content failures at greater depth (models reason correctly but make systematic errors even after intervention).
  • A “trojan operator” experiment (relabeling XOR’s truth table with a novel name) indicates that name alone does not determine reasoning correctness, while some models show widening performance degradation as novelty increases.

Abstract

LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4's depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors post-intervention). A Trojan operator (XOR's truth table under a novel name) confirms name alone does not gate reasoning (p >= 0.49), while Llama's novelty gap widens to 28pp at depth 8-9 with the Trojan at 92-100%, isolating genuine difficulty with novel logic from name unfamiliarity.