Explanation Generation for Contradiction Reconciliation with LLMs

arXiv cs.CL / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a new task, “reconciliatory explanation generation,” where LLMs must produce explanations that make seemingly contradictory statements mutually compatible rather than treating contradictions as errors.
  • It proposes repurposing existing NLI datasets for this purpose and adds quality metrics to support scalable automatic evaluation.
  • Experiments across 18 LLMs show that most models only achieve limited success, revealing a largely under-explored capability gap in LLM reasoning for contradiction reconciliation.
  • The study finds that increasing test-time compute via “thinking” helps only up to a point, as benefits plateau with larger model sizes.
  • The authors argue the findings are relevant for improving downstream applications such as chatbots and scientific assistance that rely on richer, explanation-based reasoning.

Abstract

Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, "Cassie hates coffee" and "She buys coffee everyday" may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by "thinking" plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs' downstream applications such as chatbots and scientific aids.