Smart Ensemble Learning Framework for Predicting Groundwater Heavy Metal Pollution

arXiv cs.AI / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The study addresses biased heavy metal pollution predictions in groundwater by modeling the Heavy Metal Pollution Index (HPI), which is skewed and influenced by correlated contaminants across space.
It introduces a predictive framework that combines response transformations (raw, log, and Gaussian copula) with nested cross-validated ensemble machine learning.
Models trained on raw HPI showed misleadingly near-perfect fit (Elastic Net and stacked ensemble R^2 ≈ 1.0), while log transformation improved stability and Gaussian copula produced the most reliable accuracy.
Using a copula-based stacked ensemble, the framework achieved R^2 = 0.96 with improved residual behavior and spatially plausible contamination maps.
Clustering diagnostics (DBSCAN) identified Fe and Mn as key contributors to HPI, and the authors note limitations of non-spatial validation and basin-specific applicability, recommending spatial validation in future work.

Abstract

Groundwater in the Densu Basin is increasingly threatened by heavy metal contamination, but conventional methods fail to capture the statistical complexity and spatial heterogeneity of pollution indicators. A key challenge is modelling the Heavy Metal Pollution Index (HPI), which is typically skewed and affected by correlated contaminants, leading to biased predictions without transformation. This study develops a predictive framework integrating response transformations with nested cross-validated ensemble machine learning. Three transformations (raw, log, and Gaussian copula) were applied to HPI and evaluated across six learners: support vector regression (SVM),

k

-nearest neighbours (k-NN), CART, Elastic Net, kernel ridge regression, and a stacked Lasso ensemble. Raw-scale models produced deceptively high fits (Elastic Net and stacked ensemble

R^2 \approx 1.0

), suggesting over-optimism. The log transformation stabilised variance (SVM:

R^2 = 0.93

, RMSE

= 0.18

; k-NN:

R^2 = 0.92

, RMSE

= 0.20

). The Gaussian copula gave the most reliable results: stacked ensemble

R^2 = 0.96

(RMSE

= 0.19

), with other learners maintaining high accuracy. Copula-based models improved residuals and produced spatially plausible maps. DBSCAN clustering revealed Fe and Mn as primary HPI contributors, consistent with regional hydrogeochemistry. Limitations include reliance on random (not spatial) cross-validation and basin-specific scope. Future work should explore spatial validation and other geological settings. Overall, distribution-aware ensembles with clustering diagnostics offer robust, interpretable assessments of groundwater contamination.

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B

Reddit r/LocalLLaMA

Smart Ensemble Learning Framework for Predicting Groundwater Heavy Metal Pollution

Key Points

Abstract

Related Articles

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Roundtable chat with Talkie-1930 and Gemma 4 31B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer