Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverage Guarantees

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that conventional single-response (point) outputs from large language models can underestimate performance, and proposes switching to set-valued prediction that returns a set of candidate answers.
  • It introduces a feasibility-aware framework for coverage guarantees, showing that due to the finite and sampling-based nature of LLM generation, achieving coverage is not always possible for every question.
  • The work defines a Minimum Achievable Risk level (MRL), below which statistical coverage guarantees cannot be met, even with repeated sampling.
  • It presents a data-driven calibration method that uses sampled responses to estimate a threshold, enabling prediction sets that include a correct answer with the desired probability whenever the risk target is feasible.
  • Experiments across six generation tasks and five LLMs indicate the approach is both statistically valid and efficient in producing reliable prediction sets.

Abstract

Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model's capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.