Off-Policy Evaluation and Learning for Survival Outcomes under Censoring

arXiv stat.ML / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses how to optimize and evaluate survival-related objectives (e.g., patient survival or customer retention) from logged data using Off-Policy Evaluation (OPE), avoiding risky online experiments.
  • It argues that standard OPE estimators fail for right-censored outcomes because they ignore unobserved survival times past censoring, which can systematically underestimate policy performance.
  • The authors propose new censoring-aware estimators, IPCW-IPS and IPCW-DR, based on Inverse Probability of Censoring Weighting to correct for censoring bias.
  • They prove unbiasedness for the proposed estimators and show IPCW-DR is doubly robust (consistent if either the propensity model or outcome model is correct).
  • The framework is further extended to constrained Off-Policy Learning under budget constraints, with validation via simulations and demonstrations on public real-world datasets.

Abstract

Optimizing survival outcomes, such as patient survival or customer retention, is a critical objective in data-driven decision-making. Off-Policy Evaluation~(OPE) provides a powerful framework for assessing such decision-making policies using logged data alone, without the need for costly or risky online experiments in high-stakes applications. However, typical estimators are not designed to handle right-censored survival outcomes, as they ignore unobserved survival times beyond the censoring time, leading to systematic underestimation of the true policy performance. To address this issue, we propose a novel framework for OPE and Off-Policy Learning~(OPL) tailored for survival outcomes under censoring. Specifically, we introduce IPCW-IPS and IPCW-DR, which employ the Inverse Probability of Censoring Weighting technique to explicitly deal with censoring bias. We theoretically establish that our estimators are unbiased and that IPCW-DR achieves double robustness, ensuring consistency if either the propensity score or the outcome model is correct. Furthermore, we extend this framework to constrained OPL to optimize policy value under budget constraints. We demonstrate the effectiveness of our proposed methods through simulation studies and illustrate their practical impacts using public real-world data for both evaluation and learning tasks.