LLM-Extracted Covariates for Clinical Causal Inference: Rethinking Integration Strategies

arXiv cs.LG / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study addresses a key limitation of causal inference from EHRs—unmeasured confounding from clinically important states recorded in free text—and proposes using LLMs to extract those states as structured covariates.
Using 21,859 sepsis patients from MIMIC-IV, the authors compare seven integration strategies for estimating the effect of early vasopressor initiation on 28-day mortality, including tabular baselines, traditional NLP features, and three LLM-augmented methods.
The best-performing approach is to directly augment the propensity score model with LLM-derived covariates, while strategies like dual-caliper matching based on text-derived categorical distances can worsen results by shrinking the donor pool.
In semi-synthetic experiments with known ground-truth effects, LLM-augmented propensity scores sharply reduce bias (from 0.0143 to 0.0003) versus tabular-only methods, and the improvement remains under substantial simulated extraction error.
On real data, adding LLM-extracted covariates reduces the estimated treatment effect (0.055 to 0.027) in a way that aligns directionally with the CLOVERS randomized trial, and doubly robust estimation (0.019) further supports robustness.

Abstract

Causal inference from electronic health records (EHR) is fundamentally limited by unmeasured confounding: critical clinical states such as frailty, goals of care, and mental status are documented in free-text notes but absent from structured data. Large language models can extract these latent confounders as interpretable, structured covariates, yet how to effectively integrate them into causal estimation pipelines has not been systematically studied. Using the MIMIC-IV database with 21,859 sepsis patients, we compare seven covariate-integration strategies for estimating the effect of early vasopressor initiation on 28-day mortality, spanning tabular-only baselines, traditional NLP representations, and three LLM-augmented approaches. A central finding is that not all integration strategies are equally effective: directly augmenting the propensity score model with LLM covariates achieves the best performance, while dual-caliper matching on text-derived categorical distances restricts the donor pool and degrades estimation. In semi-synthetic experiments with known ground-truth effects, LLM-augmented propensity scores reduce estimation bias from 0.0143 to 0.0003 relative to tabular-only methods, and this advantage persists under substantial simulated extraction error. On real data, incorporating LLM-extracted covariates reduces the estimated treatment effect from 0.055 to 0.027, directionally consistent with the CLOVERS randomized trial, and a doubly robust estimator yielding 0.019 confirms the robustness of this finding. Our results offer practical guidance on when and how text-derived covariates improve causal estimation in critical care.