Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

arXiv cs.CL / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates uncertainty quantification for black-box LLM queries where only a small number of responses can be sampled, making accurate risk estimation difficult.
  • It uses the “effective semantic alphabet size” (the number of distinct meanings in sampled responses) as a proxy for downstream hallucination risk.
  • Frequency-only estimators can miss rare semantic modes with small samples, and graph-spectral measures alone cannot reliably estimate semantic occupancy.
  • The authors propose SHADE, which fuses Generalized Good-Turing coverage with a heat-kernel trace from an entailment-weighted graph over sampled responses, using adaptive fusion rules based on estimated coverage.
  • Experiments on semantic alphabet-size estimation and QA incorrectness detection show SHADE provides the largest gains in the most sample-limited settings, with improvements diminishing as sample counts grow.

Abstract

This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size--that is, the number of distinct meanings expressed in the sampled responses--provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.