How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
arXiv cs.CL / 4/9/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how exposing an LLM judge to a generator’s reasoning chains changes the judge’s ability to assess answer factuality across factual QA and math reasoning benchmarks.
- It finds that “weak” judges are often overly influenced by the presence of reasoning, tending to accept incorrect answers when accompanied by fluent-sounding explanations.
- “Strong” judges can use reasoning as partial evidence for correctness, but they are still frequently misled by reasoning chains that appear high-quality.
- Controlled experiments show that both the fluency and the factuality of the reasoning chain act as key signals that drive judge decisions, enabling superficial reasoning to bias outcomes.
- The results suggest that robust LLM judges must be able to distinguish genuinely informative reasoning from superficial fluency when evaluating modern reasoning-capable models.
Related Articles

Black Hat Asia
AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter
TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to