bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R

arXiv stat.ML / 4/14/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces bioLeak, an R package designed to reduce optimistic bias from data leakage in biomedical machine learning by using leakage-aware resampling workflows and diagnostics.
  • It addresses limitations of standard row-wise cross-validation and global preprocessing when data has repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies.
  • bioLeak supports train-fold-only preprocessing, nested hyperparameter tuning, cross-validated model fitting, and post hoc leakage audits with HTML reporting for interpretability.
  • The package is implemented to handle multiple ML task types (binary/multiclass classification, regression, and survival analysis) and uses structured S4 containers for splits, fits, audits, and inflation summaries.
  • Simulation and a transcriptomics case study show that guarded vs. leaky pipelines can produce materially different performance conclusions, highlighting the value of auditing and reproducible pipeline design.

Abstract

Data leakage remains a recurrent source of optimistic bias in biomedical machine learning studies. Standard row-wise cross-validation and globally estimated preprocessing steps are often inappropriate for data with repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies. This paper describes bioLeak, an R package for constructing leakage-aware resampling workflows and for auditing fitted models for common leakage mechanisms. The package provides leakage-aware split construction, train-fold-only preprocessing, cross-validated model fitting, nested hyperparameter tuning, post hoc leakage audits, and HTML reporting. The implementation supports binary classification, multiclass classification, regression, and survival analysis, with task-specific metrics and S4 containers for splits, fits, audits, and inflation summaries. The simulation artifacts show how apparent performance changes under controlled leakage mechanisms, and the case study illustrates how guarded and leaky pipelines can yield materially different conclusions on multi-study transcriptomic data. The emphasis throughout is on software design, reproducible workflows, and interpretation of diagnostic output.