Hierarchical Contrastive Learning for Multimodal Data

arXiv stat.ML / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard multimodal “shared vs private” representation learning is too simplistic because many latent factors are shared only across subsets of modalities rather than all of them.
  • It introduces Hierarchical Contrastive Learning (HCL), which learns a unified set of representations capturing globally shared, partially shared, and modality-specific factors using a hierarchical latent-variable formulation plus structural sparsity.
  • HCL uses a structure-aware contrastive objective that aligns only modality pairs that genuinely share a latent factor, aiming to avoid over-alignment of unrelated signals.
  • Under assumptions of uncorrelated latent variables, the authors provide identifiability and recovery guarantees, along with parameter estimation and excess-risk bounds for downstream prediction.
  • Experiments (simulations and multimodal electronic health records) show that HCL recovers hierarchical structure more accurately and improves predictive performance using more informative representations.

Abstract

Multimodal representation learning is commonly built on a shared-private decomposition, treating latent information as either common to all modalities or specific to one. This binary view is often inadequate: many factors are shared by only subsets of modalities, and ignoring such partial sharing can over-align unrelated signals and obscure complementary information. We propose Hierarchical Contrastive Learning (HCL), a framework that learns globally shared, partially shared, and modality-specific representations within a unified model. HCL combines a hierarchical latent-variable formulation with structural sparsity and a structure-aware contrastive objective that aligns only modalities that genuinely share a latent factor. Under uncorrelated latent variables, we prove identifiability of the hierarchical decomposition, establish recovery guarantees for the loading matrices, and derive parameter estimation and excess-risk bounds for downstream prediction. Simulations show accurate recovery of hierarchical structure and effective selection of task-relevant components. On multimodal electronic health records, HCL yields more informative representations and consistently improves predictive performance.