A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study

arXiv cs.AI / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The study argues that interpretability is essential for auditing deep clinical predictive models in high-stakes healthcare settings and highlights open questions about how architectural choices and explanation methods interact.
It introduces a comprehensive, extensible benchmark that evaluates interpretability methods across multiple clinical prediction tasks and different model architectures, aiming to improve reproducibility versus earlier benchmarking efforts.
The results indicate that attention, when properly leveraged, can provide faithful and computationally efficient explanations for model predictions.
The authors find that black-box interpretability tools such as KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks.
The paper also identifies several interpretability approaches as too unreliable to trust and provides practical guidelines, releasing implementations via the open-source PyHealth framework.

Abstract

Clinical decisions are high-stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well-documented open-source framework: https://github.com/sunlabuiuc/PyHealth.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

We built a 9-item checklist that catches LLM coding agent failures before execution starts

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

How to Build an Automated SEO Workflow with AI: Lessons Learned from Developing SEONIB

Dev.to

A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

We built a 9-item checklist that catches LLM coding agent failures before execution starts

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

How to Build an Automated SEO Workflow with AI: Lessons Learned from Developing SEONIB

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer