Validation of Whole-Slide Foundation Models for Image Retrieval in TCGA Data

arXiv cs.CV / 5/5/2026

📰 NewsModels & Research

Key Points

  • The study benchmarks 10 whole-slide image retrieval pipelines on 9,387 TCGA diagnostic slides across 17 organs and 60 diagnoses using patient-level leave-one-patient-out evaluation.
  • Results show performance differences depend more on organ and diagnosis than on model architecture, with TITAN performing best overall but only with a modest advantage.
  • ABMIL (supervised multiple-instance aggregation on patch embeddings) and patch-based retrieval methods achieve broadly comparable Top-1/Top-3 accuracy, and no single architecture is consistently dominant.
  • Retrieval is largely driven by patch-level feature representations, with limited gains from slide-level aggregation, implying aggregation may often be unnecessary.
  • Morphology-only retrieval hits an intrinsic ceiling: rare/heterogeneous or closely related subtypes remain difficult, some subtypes get 0% accuracy across all methods, and the best model reaches only about 68% ± 21% accuracy on TCGA—underscoring challenges before clinical deployment.

Abstract

Foundation models are reshaping computational histopathology, yet their value for whole-slide image retrieval relative to strong patch-based and supervised aggregation baselines remains unclear. We benchmarked ten pipelines on 9,387 diagnostic slides spanning 17 organs and 60 diagnoses from The Cancer Genome Atlas (TCGA) using patient-level leave-one-patient-out evaluation. Methods included four pre-trained slide foundation models, a supervised attention-based multiple instance learning (ABMIL) aggregator on patch embeddings, and patch-level retrieval across five sampling densities. Performance varied more across organs and diagnoses than across architectures. Although the slide foundation model TITAN achieved the strongest overall results, its advantage was modest; ABMIL and patch-based methods reached comparable Top-1 and Top-3 accuracy, with no model consistently dominant. Morphologically distinctive entities approached ceiling performance, while rare, heterogeneous, and closely related subtypes remained challenging. Misclassifications aligned with organs exhibiting known inter-observer variability, suggesting an intrinsic ceiling for morphology-only retrieval. Performance was driven primarily by patch-level feature representations, with limited benefit from slide-level aggregation, indicating aggregation may be unnecessary in many settings. These findings argue against a universally optimal architecture and instead support organ-resolved benchmarking, diagnosis-aware or ensemble strategies, stronger feature representations, and multimodal retrieval frameworks. Notably, even the best model achieved only \approx 68\% \pm 21\% retrieval accuracy on TCGA, and some subtypes showed 0\% accuracy across all methods, highlighting fundamental limitations of morphology-based representations and the need for substantial progress before reliable clinical deployment.