E-Scores for (In)Correctness Assessment of Generative Model Outputs

arXiv stat.ML / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that assessing the correctness of generative model (especially LLM) outputs lacks strong, principled mechanisms despite widespread use.
  • It critiques prior conformal-prediction approaches that rely on p-values because post-hoc selection of tolerance can enable p-hacking and undermine theoretical guarantees.
  • The authors propose using the conformal prediction framework with e-values to generate e-scores that quantify incorrectness while retaining error guarantees.
  • e-scores are designed to allow users to set tolerance levels in a data-dependent way, and to provide an additional bound on size distortion as a post-hoc notion of error.
  • Experiments show the method works for different correctness notions, including mathematical factuality and constraint/property satisfaction.

Abstract

While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as measures of incorrectness. In addition to achieving the guarantees as before, e-scores further provide users with the flexibility of choosing data-dependent tolerance levels while upper bounding size distortion, a post-hoc notion of error. We experimentally demonstrate their efficacy in assessing LLM outputs under different forms of correctness: mathematical factuality and property constraints satisfaction.