Calibrated Confidence Estimation for Tabular Question Answering

arXiv cs.CL / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reports the first systematic comparison of five confidence estimation methods for tabular question answering using five frontier LLMs across two tabular QA benchmarks.
It finds that LLMs are consistently severely overconfident on structured tabular data (smooth ECE 0.35–0.64), unlike the much better-calibrated behavior reported for textual QA.
The results show a clear pattern: self-evaluation-based approaches underperform (AUROC 0.42–0.76) compared with perturbation-based methods (semantic entropy, self-consistency, and Multi-Format Agreement) which reach AUROC 0.78–0.86.
The proposed Multi-Format Agreement (MFA) leverages deterministic, lossless serialization differences (Markdown/HTML/JSON/CSV) to estimate confidence, reducing ECE by 44–63% while cutting API cost by ~20% versus sampling methods.
Structure-aware recalibration further improves performance, and combining MFA with sampling ensembles boosts AUROC from 0.74 to 0.82.

Abstract

Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p<0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/15DailyView insight →

As China’s biotech firms shift gears, can AI floor the accelerator?

SCMP Tech

AI startup claims to automate app making but actually just uses humans

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

"OpenAI Codex Just Got Computer Use, Image Gen, and 90 Plugins. 3 Things Nobody's Telling You."

Dev.to

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs HallucinationEvaluation

Dev.to

Calibrated Confidence Estimation for Tabular Question Answering

Key Points

Abstract

💡 Insights using this article

Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?

AI startup claims to automate app making but actually just uses humans

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

"OpenAI Codex Just Got Computer Use, Image Gen, and 90 Plugins. 3 Things Nobody's Telling You."

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs HallucinationEvaluation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer