Good Rankings, Wrong Probabilities: A Calibration Audit of Multimodal Cancer Survival Models

arXiv cs.LG / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that high discriminative performance (e.g., concordance index) of multimodal cancer survival models does not guarantee that predicted survival probabilities are statistically calibrated.
  • It reports what it claims is the first systematic fold-level 1-calibration audit for multimodal whole-slide imaging (WSI) plus genomics survival architectures across multiple TCGA cancer datasets.
  • In experiments using native discrete-time outputs, all tested models fail 1-calibration on most folds, with many fold-level tests rejecting correct calibration after multiple-testing correction.
  • The study finds that gating-based fusion tends to yield better calibration than bilinear or concatenation fusion, and that post-hoc Platt scaling can improve calibration at the evaluated horizon without reducing discrimination.
  • The authors conclude that calibration audits are necessary for clinical readiness and that using concordance index alone can be misleading.

Abstract

Multimodal deep learning models that fuse whole-slide histopathology images with genomic data have achieved strong discriminative performance for cancer survival prediction, as measured by the concordance index. Yet whether the survival probabilities derived from these models - either directly from native outputs or via standard post-hoc reconstruction - are calibrated remains largely unexamined. We conduct, to our knowledge, the first systematic fold-level 1-calibration audit of multimodal WSI-genomics survival architectures, evaluating native discrete-time survival outputs (Experiment A: 3 models on TCGA-BRCA) and Breslow-reconstructed survival curves from scalar risk scores (Experiment B: 11 architectures across 5 TCGA cancer types). In Experiment A, all three models fail 1-calibration on a majority of folds (12 of 15 fold-level tests reject after Benjamini-Hochberg correction). Across the full 290 fold-level tests, 166 reject the null of correct calibration at the median event time after Benjamini-Hochberg correction (FDR = 0.05). MCAT achieves C-index 0.817 on GBMLGG yet fails 1-calibration on all five folds. Gating-based fusion is associated with better calibration; bilinear and concatenation fusion are not. Post-hoc Platt scaling reduces miscalibration at the evaluated horizon (e.g., MCAT: 5/5 folds failing to 2/5) without affecting discrimination. The concordance index alone is insufficient for evaluating survival models intended for clinical use.