FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

arXiv cs.AI / 4/27/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The FeatEHR-LLM framework uses large language models (LLMs) to automatically generate clinically meaningful tabular features from irregularly sampled electronic health record (EHR) time series.
  • It addresses EHR-specific challenges such as irregular observation intervals, varying measurement frequencies, and structural sparsity by using tool-augmented mechanisms that query temporal data and produce feature-extraction code that handles uneven patterns.
  • To protect patient privacy, the LLM generates features using only dataset schemas and task descriptions rather than accessing raw patient records.
  • The system supports both univariate and multivariate feature generation via an iterative, validation-in-the-loop pipeline.
  • Across eight clinical prediction tasks on four ICU datasets, FeatEHR-LLM achieved the best mean AUROC on 7 of 8 tasks, improving results by up to 6 percentage points over strong baselines.

Abstract

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.