MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality

arXiv cs.LG / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • MUST(Modality-Specific representation-aware Transformer)は、多モダリティ医療データにおける「欠損モダリティ」を、各モダリティの固有情報と他モダリティから推定可能な文脈情報に分解して明示的に扱う枠組みを提案しています。
  • 学習した低ランクの共有サブスペース上での代数的制約により、モダリティ欠損時に失われる情報の特定(何が再現できないかの識別)を可能にします。
  • 欠損時に推定不能な“真のモダリティ固有情報”には、回復した共有情報に条件付けた潜在拡散モデルで表現を生成し、構造的な事前知識を反映させます。
  • TCGAの5がんデータセットで、完全データ時にSOTA級のサバイバル予測性能を達成しつつ、病理欠損/ゲノム欠損の条件でも堅牢な予測を示し、臨床的に許容できる推論遅延も報告されています。

Abstract

Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose MUST (Modality-Specific representation-aware Transformer), a novel framework that explicitly decomposes each modality's representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables precise identification of what information is lost when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that MUST achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions, with clinically acceptable inference latency.