Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models

arXiv cs.LG / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a fixed-budget, one-epoch pretraining benchmark to isolate how input representation choices affect downstream performance in generative medical event models.
  • Using 28 matched transformer models trained on MIMIC-IV and evaluated across 30 clinical outcomes, the study systematically tests representation variants such as quantization granularity, reference-range anchoring, and code-value fusion.
  • Code-value fused tokenization significantly improves mortality AUROC (0.891→0.915), hospital length-of-stay AUROC (0.763→0.788), and increases mean regression Spearman rho (0.414→0.494), with statistically significant results.
  • For temporal encoding, simple event-order and admission-relative RoPE approaches match or outperform time tokens on average while reducing sequence length by 11%.
  • CLIF remapping for lab/vital codes preserves downstream performance in the authors’ single-site setting and produces a smaller, more clinically interpretable token set intended to support multi-site compatibility.

Abstract

Every prediction from a generative medical event model is bounded by how clinical events are tokenized, yet input representation is rarely isolated from other system and architectural choices. We evaluate how representation decisions affect downstream prediction after a shared one-epoch pretraining budget. We train 28 matched transformers on MIMIC-IV and evaluate them on 30 clinical outcomes in three experiments: (1) quantization granularity, reference-range anchoring, and code-value fusion; (2) value encoding (hard bins, soft discretization, code-normalized xVal) crossed with temporal encoding (event order, time tokens, admission-relative RoPE); and (3) native MIMIC laboratory/vital codes versus the Common Longitudinal ICU Format (CLIF)-remapped laboratory/vital codes with compression-preserving perturbation arms. In Experiment 1, fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 (BH-adjusted p < 0.001), hospital length-of-stay AUROC from 0.763 to 0.788 (BH-adjusted p < 0.001), and, for the decile fused-vs-unfused comparison, mean regression Spearman rho across the 13 regression outcomes from 0.414 to 0.494. Across the three temporal encodings, event order only and admission-relative RoPE match or exceed inserting time tokens on average while shortening sequences by 11%. CLIF remapping preserves downstream performance in our single-site setting while yielding a smaller, clinically interpretable token set compatible with multi-site use. Finer-than-decile quantization, reference-range anchoring, and soft discretization help in selective outcomes, while code-normalized xVal remains well below the discrete and soft families, consistent with near-median suppression that persists after the affine variant.