Tokenization Tradeoffs in Structured EHR Foundation Models
arXiv cs.LG / 3/18/2026
📰 NewsModels & Research
Key Points
- The paper pretrained a transformer on pediatric EHR data while factorially varying tokenization along three axes: event encoding, time encoding, and workflow annotation.
- Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining FLOPs, respectively.
- The advantage is attributed to local binding efficiency, with code-attribute pairs combined into single tokens instead of being split across tokens that the model must learn to associate.
- External evaluation on an adult ICU cohort shows the gains generalize despite vocabulary mismatch, though temporal and workflow effects appear institution-specific.
- The findings position tokenization as a practical lever to improve both performance and efficiency in EHR foundation models.




![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3833034%252F44fa15e0-8eb9-4843-a424-a4a7b3538f43.jpeg&w=3840&q=75)