DeltaLogic: Minimal Premise Edits Reveal Belief-Revision Failures in Logical Reasoning Models

arXiv cs.AI / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • DeltaLogic is introduced as a benchmark protocol that tests belief revision under minimal premise edits by turning static reasoning problems into short “revision episodes.”
  • The method first elicits a conclusion from premises P, then applies a small edit δ(P), and finally checks whether the model’s prior conclusion should stay stable or be revised.
  • Experiments using FOLIO and ProofWriter show that stronger initial logical reasoning does not reliably translate to stronger revision behavior after local evidence changes (e.g., Qwen3-1.7B has higher initial accuracy than revision accuracy).
  • Some models exhibit notable “inertia” patterns and other failure modes such as near-universal abstention or control instability, indicating distinct weaknesses beyond fixed-premise inference.
  • The authors argue DeltaLogic measures a practically important capability—disciplined belief revision—that complements existing logical reasoning benchmarks.

Abstract

Reasoning benchmarks typically evaluate whether a model derives the correct answer from a fixed premise set, but they under-measure a closely related capability that matters in dynamic environments: belief revision under minimal evidence change. We introduce DeltaLogic, a benchmark transformation protocol that converts natural-language reasoning examples into short revision episodes. Each episode first asks for an initial conclusion under premises P, then applies a minimal edit {\delta}(P), and finally asks whether the previous conclusion should remain stable or be revised. We instantiate DeltaLogic from FOLIO and ProofWriter and evaluate small causal language models with constrained label scoring. On a completed 30-episode Qwen evaluation subset, stronger initial reasoning still does not imply stronger revision behavior: Qwen3-1.7B reaches 0.667 initial accuracy but only 0.467 revision accuracy, with inertia rising to 0.600 on episodes where the gold label should change, while Qwen3-0.6B collapses into near universal abstention. There, Qwen3-4B preserves the same inertial failure pattern (0.650 initial, 0.450 revised, 0.600 inertia), whereas Phi-4-mini-instruct is substantially stronger (0.950 initial, 0.850 revised) but still exhibits non-trivial abstention and control instability. These results suggest that logical competence under fixed premises does not imply disciplined belief revision after local evidence edits. DeltaLogic therefore targets a distinct and practically important reasoning capability that complements existing logical inference and belief-updating benchmarks.