Identifying Information from Observations with Uncertainty and Novelty

arXiv stat.ML / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how a learning system can identify the underlying data-generating process from observations that include uncertainty and novelty, while choosing the hypothesis that best matches the observed data.
  • It formalizes “identifying information” as the bits that verify or falsify a hypothesis about the data-generating process, and provides information-theoretic characterization of hypothesis identification computation.
  • The authors define hypothesis identification and sample complexity using an indicator-function computation over a hypothesis set, connecting algorithmic and probabilistic notions of information.
  • They derive sample-complexity results across different data-generating regimes, from deterministic processes to ergodic stationary stochastic processes, linking finite-step identification with asymptotic statistics and PAC-learning.
  • The work also shows that, for computable PAC-Bayes learners on a fixed finite hypothesis set, the sample-complexity distribution is determined by moments under the prior, making accurate approximations computable to arbitrary precision within available resources.

Abstract

A machine that learns a task from observations must encounter and process uncertainty and novelty, especially when it is to maintain performance when observing new information and to select the hypothesis that best fits the current observations. In this context, some key questions arise: what and how much information did the observations provide, how much information is required to identify the data-generating process, how many observations remain to get that information, and how does a predictor determine that it has observed novel information? We formalize identifying information to answer these questions and synthesize prior works. Identifying information are bits that verify or falsify a hypothesis as the data-generating process. In this formalization, we prove the information theoretic characteristics of the computation of hypothesis identification and the resulting sample complexity. We define hypothesis identification and sample complexity via the computation of an indicator function over a set of hypotheses, bridging algorithmic and probabilistic information. We detail the sample complexity and its properties for data-generating processes ranging from deterministic processes to ergodic stationary stochastic processes, which connect the notion of identifying information in finite steps with asymptotic statistics and PAC-learning. The indicator function's computation naturally formalizes novel information and its identification from observations with respect to a hypothesis set, which detects a misspecified hypothesis set. We also proved that a computable PAC-Bayes learners' sample complexity distribution is determined by its moments in terms of the prior probability distribution over a fixed finite hypothesis set, and thus an approximation of the sample complexity distribution is always computable within the desired precision that resources allow.