Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies three-way logical QA, where models must assign True/False/Unknown to a hypothesis given a premise set, and highlights failure modes including negation inconsistency and inappropriate epistemic Unknown predictions.
It introduces CGD-PD, a lightweight test-time decoding layer that queries a single 3-way classifier on both a hypothesis H and its mechanically negated form, then enforces negation-consistent outputs when possible.
For remaining Unknown cases, CGD-PD performs proof-driven disambiguation using targeted binary entailment probes to resolve the label more selectively rather than relying on raw uncertainty.
Evaluated on the FOLIO benchmark’s first-order-logic fields, the method improves accuracy consistently across frontier LLMs, with relative gains up to 16% and a reduction in Unknown predictions, using only about 4–5 model calls on average.

Abstract

Three-way logical question answering (QA) assigns

True/False/Unknown

to a hypothesis

H

given a premise set

S

. While modern large language models (LLMs) can be accurate on isolated examples, we identify two recurring failure modes in 3-way logic QA: (i) negation inconsistency, where answers to

H

and

eg H

violate the deterministic label mapping, and (ii) epistemic

Unknown

, where the model predicts

Unknown

due to uncertainty or instability even when

S

entails one side. We present CGD-PD, a lightweight test-time layer that (a) queries a single 3-way classifier on both

H

and a mechanically negated form of

H

, (b) projects the pair onto a negation-consistent decision when possible, and (c) invokes a proof-driven disambiguation step that uses targeted binary entailment probes to selectively resolve

Unknown

outcomes, requiring only an average of 4-5 model calls. On the FOLIO benchmark's first-order-logic fields, CGD-PD yields consistent gains across frontier LLMs, with relative improvements in accuracy of up to 16% over the base model, while also reducing

Unknown

predictions.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer