Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

arXiv cs.CL / 4/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLM reasoning traces can fail in two distinct ways: flawed content within steps (e.g., logical errors or hallucinations) and flawed step behavior (e.g., overthinking or underthinking), with the issues varying across samples.
  • It reports that simply providing ground-truth labels to guide reasoning does not improve overall reasoning ability, contradicting a common intuition.
  • To address both step-internal and step-wise flaws, it introduces CRAFT, which constructs a Reasoning Knowledge Graph (RKG) from the consensus portions of multiple candidate traces.
  • CRAFT then synthesizes a final reasoning trace using topological generation over the RKG, aiming to produce more robust and reliable step sequences.
  • Experiments show 10%+ gains in label-prediction accuracy on average and consistent improvements over baselines on logical and mathematical reasoning benchmarks, with added evidence that trace quality improves across multiple evaluation dimensions.

Abstract

LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs' reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs' reasoning traces in multiple dimensions.