How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how exposing an LLM judge to a generator’s reasoning chains changes the judge’s ability to assess answer factuality across factual QA and math reasoning benchmarks.
  • It finds that “weak” judges are often overly influenced by the presence of reasoning, tending to accept incorrect answers when accompanied by fluent-sounding explanations.
  • “Strong” judges can use reasoning as partial evidence for correctness, but they are still frequently misled by reasoning chains that appear high-quality.
  • Controlled experiments show that both the fluency and the factuality of the reasoning chain act as key signals that drive judge decisions, enabling superficial reasoning to bias outcomes.
  • The results suggest that robust LLM judges must be able to distinguish genuinely informative reasoning from superficial fluency when evaluating modern reasoning-capable models.

Abstract

Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.