Submodular Benchmark Selection

arXiv cs.AI / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the high cost of evaluating large language models across many correlated benchmarks by framing benchmark subset selection as submodular maximization under a multivariate Gaussian assumption.
  • It derives natural objectives based on entropy (log-determinant covariance) and mutual information between selected and remaining benchmarks, showing that entropy is submodular and connects to pivoted Cholesky with spectral residual bounds.
  • It finds that mutual information is generally non-monotone, but is empirically monotone for small subsets, enabling greedy optimization.
  • Experiments using three matrices derived from ten public leaderboards indicate that mutual-information-based selection performs better than entropy-based selection for imputation when selecting small subsets.

Abstract

Evaluating large language models across many benchmarks is expensive, yet many benchmarks are highly correlated. We formalize the selection of a small, informative subset as submodular maximization under a multivariate Gaussian model. Entropy (log-determinant covariance) and mutual information between selected and remaining benchmarks arise as natural objectives. Both are submodular; entropy selection coincides with pivoted Cholesky and has spectral residual bounds, while mutual information is non-monotone in general but empirically monotone for small subsets, so we optimize it greedily. Experiments on three matrices from ten public leaderboards show that mutual information selection outperforms entropy for imputation at small subsets.