bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R
arXiv stat.ML / 4/14/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper introduces bioLeak, an R package designed to reduce optimistic bias from data leakage in biomedical machine learning by using leakage-aware resampling workflows and diagnostics.
- It addresses limitations of standard row-wise cross-validation and global preprocessing when data has repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies.
- bioLeak supports train-fold-only preprocessing, nested hyperparameter tuning, cross-validated model fitting, and post hoc leakage audits with HTML reporting for interpretability.
- The package is implemented to handle multiple ML task types (binary/multiclass classification, regression, and survival analysis) and uses structured S4 containers for splits, fits, audits, and inflation summaries.
- Simulation and a transcriptomics case study show that guarded vs. leaky pipelines can produce materially different performance conclusions, highlighting the value of auditing and reproducible pipeline design.

