Tokenization Tradeoffs in Structured EHR Foundation Models
arXiv cs.LG / 3/18/2026
📰 NewsModels & Research
Key Points
- The paper pretrained a transformer on pediatric EHR data while factorially varying tokenization along three axes: event encoding, time encoding, and workflow annotation.
- Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining FLOPs, respectively.
- The advantage is attributed to local binding efficiency, with code-attribute pairs combined into single tokens instead of being split across tokens that the model must learn to associate.
- External evaluation on an adult ICU cohort shows the gains generalize despite vocabulary mismatch, though temporal and workflow effects appear institution-specific.
- The findings position tokenization as a practical lever to improve both performance and efficiency in EHR foundation models.
Related Articles
[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning
Reddit r/MachineLearning
[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop
Reddit r/MachineLearning
Meet DuckLLM 1.0 My First Model!
Reddit r/LocalLLaMA
Since FastFlowLM added support for Linux, I decided to benchmark all the models they support, here are some results
Reddit r/LocalLLaMA
What measure do I use to compare nested models and non nested models in high dimensional survival analysis [D]
Reddit r/MachineLearning