Cost-optimal Sequential Testing via Doubly Robust Q-learning

arXiv stat.ML / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to learn cost-optimal sequential clinical testing policies from retrospective data, where future tests may be missing depending on earlier results (informative missingness).
  • It proposes a doubly robust Q-learning framework under a sequential missing-at-random assumption, using path-specific inverse probability weights and auxiliary contrast models to handle test-trajectory heterogeneity.
  • The method constructs orthogonal pseudo-outcomes that yield unbiased policy learning if either the acquisition (missingness) model or the contrast model is correctly specified.
  • The authors provide theoretical guarantees (oracle inequalities, convergence rates, regret and misclassification bounds) for stage-wise estimators and validate improved cost-adjusted performance via simulations and a prostate cancer cohort application.

Abstract

Clinical decision-making often involves selecting tests that are costly, invasive, or time-consuming, motivating individualized, sequential strategies for what to measure and when to stop ascertaining. We study the problem of learning cost-optimal sequential decision policies from retrospective data, where test availability depends on prior results, inducing informative missingness. Under a sequential missing-at-random mechanism, we develop a doubly robust Q-learning framework for estimating optimal policies. The method introduces path-specific inverse probability weights that account for heterogeneous test trajectories and satisfy a normalization property conditional on the observed history. By combining these weights with auxiliary contrast models, we construct orthogonal pseudo-outcomes that enable unbiased policy learning when either the acquisition model or the contrast model is correctly specified. We establish oracle inequalities for the stage-wise contrast estimators, along with convergence rates, regret bounds, and misclassification rates for the learned policy. Simulations demonstrate improved cost-adjusted performance over weighted and complete-case baselines, and an application to a prostate cancer cohort study illustrates how the method reduces testing cost without compromising predictive accuracy.