Translational Gaps in Graph Transformers for Longitudinal EHR Prediction: A Critical Appraisal of GT-BEHRT
arXiv cs.LG / 3/17/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The article provides a critical appraisal of GT-BEHRT, a graph-transformer architecture for longitudinal EHRs, arguing it addresses visit-level structure beyond treating encounters as unordered codes.
- It assesses GT-BEHRT's performance on MIMIC-IV and All of Us for heart failure prediction, noting reported AUROC of 94.37, AUPRC of 73.96, and F1 of 64.70, while questioning whether gains reflect architectural benefits or evaluation design.
- The appraisal analyzes seven dimensions of modern ML systems, including representation design, pretraining strategy, cohort construction transparency, evaluation beyond discrimination, fairness, reproducibility, and deployment feasibility.
- It identifies gaps such as the lack of calibration analysis, incomplete fairness evaluation, sensitivity to cohort selection, limited analysis across phenotypes and prediction horizons, and limited discussion of practical deployment considerations, suggesting that more rigorous evaluation is needed before clinical deployment.




