Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

arXiv cs.AI / 5/7/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that brittle performance in LLMs on complex temporal reasoning is not primarily due to autoregressive logical deduction, but rather due to failures in unstructured text-to-event representation.
  • It proposes a neuro-symbolic QA framework that converts raw text into explicit event graphs with interval constraints, separating semantic extraction from a symbolic reasoning engine.
  • The method introduces a Probabilistic Inconsistency Signal (PIS) that combines symbolic credal intervals with neural epistemic uncertainty via Evidential Deep Learning on LLM hidden states to detect structural breaks.
  • Experiments show perfect 1.0 accuracy (4000/4000) with zero false positives/negatives on temporal arithmetic benchmarks when correct structural representations are given.
  • In noisy QA settings, the framework still reaches 75.1% accuracy and provides deterministic, step-level failure localization through explicit proof traces, reframing temporal QA as a structural alignment problem.

Abstract

Despite significant advances, large language models (LLMs) continue to exhibit brittle performance on complex temporal reasoning tasks. This failure mode is widely attributed to inherent deficits in autoregressive logical deduction. In this paper, we challenge this prevailing narrative, demonstrating that temporal reasoning is not the fundamental bottleneck; rather, the locus of failure lies in unstructured text-to-event representation. We introduce a novel neuro-symbolic question-answering framework governed by a Probabilistic Inconsistency Signal (PIS) that explicitly isolates perceptual errors from reasoning failures. By lifting unstructured text into explicit event graphs and interval constraints, our architecture strictly decouples semantic extraction from a symbolic reasoning engine. To robustly detect structural breaks, the PIS elegantly unifies symbolic credal intervals with epistemic neural uncertainty extracted via Evidential Deep Learning on LLM hidden states. Empirical evaluations reveal a striking paradigm shift: when provided with correct structural representations, our system's explicit proof traces achieve perfect 1.0 accuracy (4000/4000) and strictly zero false positives/negatives on temporal arithmetic benchmarks. On broader, noise-injected QA settings, the framework maintains a competitive 75.1\% accuracy while enabling deterministic, step-level failure localization. Ultimately, by isolating the representation bottleneck from the reasoning substrate, this work reframes temporal QA from an algorithmic reasoning challenge to a structural alignment problem, charting a verifiable path forward for reliable neuro-symbolic AI.