Post-Selection Distributional Model Evaluation

arXiv stat.ML / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard model evaluation focuses on meeting a known KPI, but many real scenarios require comparing models across the full performance–reliability trade-off spectrum without knowing the target KPI in advance.
  • It introduces post-selection distributional model evaluation (PS-DME), a framework to estimate test-time KPI distributions after arbitrary, data-dependent pre-selection of candidate models.
  • PS-DME addresses post-selection bias by using e-values to control the post-selection false coverage rate (FCR) for distributional KPI estimates.
  • The authors prove PS-DME is more sample efficient than a baseline that relies on sample splitting.
  • Experiments (including text-to-SQL with large language models and telecom network evaluation) show PS-DME enables statistically reliable model/configuration comparisons across multiple reliability levels.

Abstract

Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and is proved to be more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance--reliability trade-offs.