Which Leakage Types Matter?

arXiv cs.LG / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reports 28 within-subject counterfactual experiments across 2,047 tabular datasets (and a boundary experiment across 129 temporal datasets) to quantify how four ML data leakage types affect measured performance.
Normalization/estimation leakage (e.g., fitting scalers on the full dataset) is found to be negligible, producing at most |ΔAUC| ≤ 0.005 across tested conditions.
Selection leakage (e.g., peeking during preprocessing or seed cherry-picking) is substantial, with roughly 90% of the observed performance gain attributed to noise exploitation that inflates reported scores.
Memorization leakage grows with model capacity, increasing from about d_z = 0.37 for Naive Bayes to about 1.11 for Decision Trees.
Boundary leakage is invisible under random cross-validation, and the authors argue that common textbook emphasis should be inverted: selection leakage matters most at practical dataset sizes, while normalization leakage matters least.

Abstract

Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation - fitting scalers on full data) is negligible: all nine conditions produce

|\Delta\text{AUC}| \leq 0.005

. Class II (selection - peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Dev.to

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

Reddit r/LocalLLaMA

How AI Humanizers Improve Sentence Structure and Style

Dev.to

Two Kinds of Agent Trust (and Why You Need Both)

Dev.to

Agent Diary: Apr 10, 2026 - The Day I Became a Workflow Ouroboros (While Run 236 Writes About Writing About Writing)

Dev.to

Which Leakage Types Matter?

Key Points

Abstract

Related Articles

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

How AI Humanizers Improve Sentence Structure and Style

Two Kinds of Agent Trust (and Why You Need Both)

Agent Diary: Apr 10, 2026 - The Day I Became a Workflow Ouroboros (While Run 236 Writes About Writing About Writing)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer