Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization

arXiv cs.LG / 5/4/2026

📰 NewsModels & Research

Key Points

  • The paper introduces a new self-supervised learning framework for medical time series (e.g., ECG/EEG) that compresses variable-length signals into a fixed set of k latent “Fingerprint Tokens.”
  • It uses a cross-attention bottleneck to produce the tokens and trains them with two objectives: reconstruction loss for sufficiency and a redundancy-reduction diversity penalty based on Total Coding Rate (TCR) for disentanglement.
  • The authors provide theoretical justification by formulating the problem as a “Disentangled Rate-Distortion” objective, aiming to balance information retention with token independence.
  • The method is intended to yield low-dimensional, interpretable, and sample-efficient representations that can support more robust digital biomarkers.
  • Compared with common pretraining approaches like Masked Autoencoders, the approach targets more compact and semantically interpretable latents rather than relying on heuristic aggregation such as global pooling or [CLS] tokens.

Abstract

Learning meaningful representations from medical time series (MedTS) such as ECG or EEG signals is a critical challenge. These signals are often high-dimensional, variable-length and rife with noise. Existing self-supervised approaches, such as Masked Autoencoders (MAEs) are highly effective for pre-training general-purpose encoders. However, they do not explicitly learn compact and semantically interpretable latent representations, typically relying on heuristic aggregation strategies such as global average pooling or a designated [CLS] token. We propose a novel framework that compresses a variable-length MedTS into a fixed-size set of k latent Fingerprint Tokens. Our architecture employs a cross-attention bottleneck to generate these tokens and is trained with a dual-objective function. The first objective is a reconstruction loss, which ensures the tokens are \textit{sufficient statistics} for the original data. The second, a diversity penalty based on the Total Coding Rate (TCR), explicitly minimizes the redundancy between tokens, encouraging them to become statistically \textit{disentangled} representations. We present the theoretical justification for our method, framing it as a novel \textbf{Disentangled Rate-Distortion} problem. This approach produces a low-dimensional, interpretable, and sample-efficient representation, where each token is encouraged to capture an independent factor of variation, paving the way for more robust digital biomarkers.