When Corrective Hints Hurt: Prompt Design in Reasoner-Guided Repair of LLM Overcaution on Entailed Negations under OWL~2~DL

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study identifies a reproducible failure mode in GPT-5.4 for OWL 2 DL compliance queries, where the model often outputs “unknown” instead of “no” when the correct answer is reasoner-entailed under FunctionalProperty closure or class disjointness.
  • Using 180 reasoner-audited queries (plus 18 held-out queries across insurance and clinical domains), the researchers compare four prompting/interaction modes under matched query budgets.
  • Simple generic retry (“you are wrong” style) substantially improves direct faithfulness (43.9% to 81.7%), while reasoner-guided repair with an explicit open-world-assumption hint performs worse than without the hint (67.2% vs. the higher verdict-only result).
  • The “verdict-only” reasoner-guided repair achieves near-perfect faithfulness (97.8%), and the same error fingerprint explains all failures on the held-out set (4/4).
  • The authors conclude that prompt framing can outweigh the corrective content itself and recommend ablation testing for reasoner-guided wrappers rather than assuming that hints will help.

Abstract

We report a reproducible error pattern in GPT-5.4 on OWL~2~DL compliance queries: the model frequently answers ``unknown'' when the reasoner-entailed answer is ``no'' under \emph{FunctionalProperty} closure or class \emph{disjointness}. Using 180 reasoner-audited queries from a procedural expansion of the observed pattern plus 18 hand-authored held-out queries in two unrelated domains (insurance and clinical), we compare four interaction modes under matched query budget: single-shot, three rounds of generic ``you-are-wrong'' retry, three rounds of reasoner-verdict repair with an open-world-assumption (OWA) hint, and the same repair without the hint. Direct faithfulness is 43.9\,\% (Wilson 95\,\% CI [36.8,51.2]); generic retry reaches 81.7\,\% ([75.4,86.6]); the verdict-with-hint variant is \emph{worse} at 67.2\,\% ([60.1,73.7]); the verdict-only variant reaches 97.8\,\% ([94.4,99.1]). All pairwise comparisons remain significant under McNemar's exact test with Bonferroni correction (\alpha = 0.01; all p < 10^{-5}). The same fingerprint accounts for 4/4 errors on the held-out queries. Our interpretation is bounded: prompt framing can matter more than corrective content, and reasoner-guided wrappers should be ablated explicitly.