Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper argues that self-supervised biosignal learning often ignores the directional temporal dynamics between signals from different body locations, even though they reflect a shared physiological process.
  • It introduces xMAE, a biosignal pretraining framework that performs masked cross-modal reconstruction while enforcing training constraints based on temporally ordered signals (e.g., ECG preceding PPG).
  • Experiments show that representations pretrained with xMAE outperform unimodal and multimodal baselines on 15 out of 19 downstream tasks, spanning cardiovascular outcomes, abnormal lab detection, sleep staging, and demographic inference.
  • The method also generalizes across devices, body locations, and acquisition settings, and analyses indicate that learned PPG representations capture ECG–PPG timing structure.
  • The authors conclude that incorporating temporal structure into multimodal pretraining is effective when modalities correspond to different stages of the same underlying process, and they provide code on GitHub.

Abstract

Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process. However, most existing self supervised learning methods treat these signals as interchangeable views, overlooking the directional temporal dynamics that link them. A canonical example is the relationship between electrocardiography (ECG), which captures the electrical activation initiating each heartbeat, and photoplethysmography (PPG), which records the resulting peripheral pulse delayed by vascular dynamics. To capture this structured relationship, we introduce xMAE, a biosignal pretraining framework that leverages masked cross modal reconstruction across temporally ordered biosignals as a training time constraint to encourage physiologically meaningful timing structure in the learned representations. We show that pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis suggests that the ECG PPG timing structure is reflected in the learned PPG representations. More broadly, xMAE demonstrates the effectiveness of incorporating temporal structure into multimodal pretraining when signals observe different stages of a shared underlying process. Code is available at https://github.com/hzhou3/xMAE.