Calibrated Confidence Estimation for Tabular Question Answering
arXiv cs.CL / 4/15/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper reports the first systematic comparison of five confidence estimation methods for tabular question answering using five frontier LLMs across two tabular QA benchmarks.
- It finds that LLMs are consistently severely overconfident on structured tabular data (smooth ECE 0.35–0.64), unlike the much better-calibrated behavior reported for textual QA.
- The results show a clear pattern: self-evaluation-based approaches underperform (AUROC 0.42–0.76) compared with perturbation-based methods (semantic entropy, self-consistency, and Multi-Format Agreement) which reach AUROC 0.78–0.86.
- The proposed Multi-Format Agreement (MFA) leverages deterministic, lossless serialization differences (Markdown/HTML/JSON/CSV) to estimate confidence, reducing ECE by 44–63% while cutting API cost by ~20% versus sampling methods.
- Structure-aware recalibration further improves performance, and combining MFA with sampling ensembles boosts AUROC from 0.74 to 0.82.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?
SCMP Tech

AI startup claims to automate app making but actually just uses humans
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

"OpenAI Codex Just Got Computer Use, Image Gen, and 90 Plugins. 3 Things Nobody's Telling You."
Dev.to

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs HallucinationEvaluation
Dev.to