AI Navigate

A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Histopathology

arXiv cs.CV / 3/18/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study systematically evaluates 11 publicly available Histopathology Foundation Models (HFMs) across 11 kidney-specific downstream tasks, covering multiple stains, spatial scales, and clinical objectives.
  • It employs tile-level repeated stratified group cross-validation and slide-level repeated nested stratified cross-validation, with Friedman test and pairwise Wilcoxon tests with Holm-Bonferroni correction to assess statistical significance.
  • Results show moderate to strong performance on coarse meso-scale tasks such as diagnostic classification and detection of prominent structural alterations, but performance declines for fine-grained microstructural discrimination and prognosis-related signals, largely independent of stain type.
  • The authors release kidney-hfm-eval, an open-source Python package, to reproduce the evaluation pipelines, and conclude that kidney-specific, multi-stain, and multimodal HFMs are needed for clinically reliable nephrology decision-making.

Abstract

Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, H&E, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at https://pypi.org/project/kidney-hfm-eval/ , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.