AI Navigate

[R] From Garbage to Gold: A Formal Proof that GIGO Fails for High-Dimensional Data with Latent Structure — with a Connection to Benign Overfitting Prerequisites

Reddit r/MachineLearning / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The authors formally prove that for data generated by a latent hierarchical structure Y ← S¹ → S² → S'², a Breadth strategy of expanding the predictor set dominates a Depth strategy of cleaning a fixed predictor set, due to two noise components—Predictor Error and Structural Uncertainty—that follow different information-theoretic limits.
  • The work connects to Benign Overfitting by showing that the latent structure naturally yields a low-rank-plus-diagonal covariance in S'², aligning with the spiked covariance prerequisites used to explain generalization of interpolating classifiers.
  • Empirical grounding is provided by citing a Cleveland Clinic Abu Dhabi study achieving 0.909 AUC for predicting stroke/MI in 558k patients with thousands of uncurated EHR variables, a result not explained by existing theory.
  • The paper highlights heuristics for assessing latent hierarchical structure and notes that traditional data-cleaning approaches can remain powerful in certain conditions, all within a lengthy 120-page treatment with extensive appendices.

Paper: https://arxiv.org/abs/2603.12288

GitHub (R simulation, Paper Summary, Audio Overview): https://github.com/tjleestjohn/from-garbage-to-gold

I'm Terry, the first author. This paper has been 2.5 years in the making and I'd genuinely welcome technical critique from this community.

The core result: We formally prove that for data generated by a latent hierarchical structure — Y ← S¹ → S² → S'² — a Breadth strategy of expanding the predictor set asymptotically dominates a Depth strategy of cleaning a fixed predictor set. The proof follows from partitioning predictor-space noise into two formally distinct components:

  • Predictor Error: Observational discrepancy between true and measured predictor values. Addressable by cleaning, repeated measurement, or expanding the predictor set with distinct proxies of S¹.
  • Structural Uncertainty: The irreducible ambiguity arising from the probabilistic S¹ → S² generative mapping — the information deficit that persists even with perfect measurement of a fixed predictor set. Only resolvable by expanding the predictor set with distinct proxies of S¹.

The distinction matters because these two noise types obey different information-theoretic limits. Cleaning strategies are provably bounded by Structural Uncertainty regardless of measurement precision. Breadth strategies are not.

The BO connection: We formally show that the primary structure Y ← S¹ → S² → S'² naturally produces low-rank-plus-diagonal covariance structure in S'² — precisely the spiked covariance prerequisite that the Benign Overfitting literature (Bartlett et al., Hastie et al., Tsigler & Bartlett) identifies as enabling interpolating classifiers to generalize. This provides a generative data-architectural explanation for why the BO conditions hold empirically rather than being imposed as abstract mathematical prerequisites.

Empirical grounding: The theory was motivated by a peer-reviewed clinical result at Cleveland Clinic Abu Dhabi — .909 AUC predicting stroke/MI in 558k patients using thousands of uncurated EHR variables with no manual cleaning, published in PLOS Digital Health — that could not be explained by existing theory.

Honest scope: The framework requires data with a latent hierarchical structure. The paper provides heuristics for assessing whether this condition holds. We are explicit that traditional DCAI's focus on outcome variable cleaning remains distinctly powerful in specific conditions — particularly where Common Method Variance is present.

The paper is long — 120 pages with 8 appendices — because GIGO is deeply entrenched and the theory is nuanced. The core proofs are in Sections 3-4. The BO connection is Section 7. Limitations are Section 15 and are extensive.

Fully annotated R simulation in the repo demonstrating Dirty Breadth vs Clean Parsimony across varying noise conditions.

Happy to engage with technical questions or pushback on the proofs.

submitted by /u/Chocolate_Milk_Son
[link] [comments]