A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency

arXiv cs.LG / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how changing model size affects downstream clinical risk prediction tasks using structured Japanese claims data, where scaling benefits are not guaranteed to be monotonic.
  • Researchers pretrain encoder-only Transformer foundation models at five parameter scales (2.2M–101M) on a nationwide dataset (2.3M patients from 32 hospitals) for disease incidence and medication prediction.
  • Downstream performance shows task-dependent saturation: disease prediction improves with larger models (32M–101M), while medication prediction saturates at 11M parameters, cutting pretraining time by about 178 hours.
  • For all evaluated tasks, the best foundation model outperforms a Light Gradient Boosting Machine baseline in precision-recall AUC, supporting the foundation-model approach for structured healthcare records.
  • The results provide actionable guidance for selecting an “optimal” model size that balances predictive accuracy and computational cost based on the specific task characteristics.

Abstract

Clinical risk prediction using longitudinal medical data supports individualized care. Self-supervised foundation models have emerged as a promising approach for leveraging large-scale unlabeled healthcare records. In natural language processing, scaling laws suggest that larger models achieve predictably lower pretraining losses, supporting the foundation model paradigm. However, for structured medical data, characterized by a limited vocabulary and sparse observations, whether increasing model size consistently improves downstream predictions is unclear, as most studies evaluate only a single model scale. In this study, we evaluated the relationship between model scale and downstream task performance for structured medical foundation models. Using a random sample (2.3 million patients, 32 hospitals) from a nationwide 519-hospital Japanese claims database, we pretrained encoder-only Transformers at five scales (2.2M-101M parameters) for disease incidence and medication prediction. Downstream performance saturated at task-dependent thresholds: disease prediction benefited from larger models (32M-101M), whereas medication prediction saturated at 11M, reducing pretraining time by 178 h. Across all tasks, the best-performing model consistently outperformed a Light Gradient Boosting Machine baseline in the area under the precision-recall curve. These findings indicate that, unlike the monotonically decreasing pretraining loss, the optimal model size varied depending on task characteristics. This task-dependent saturation provides practical guidance for balancing predictive performance and computational cost in structured medical foundation models.