CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance

arXiv cs.CL / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how LLM systems degrade in high-stakes settings when evidence is internally inconsistent, using healthcare cases where patient symptoms contradict medical signs.
  • It introduces MIMIC-DOS, a new ICU short-horizon organ dysfunction worsening prediction dataset derived from MIMIC-IV, curated specifically for sign-symptom discordance.
  • The authors propose CARE, a multi-stage privacy-compliant agentic reasoning framework that separates roles: a remote LLM generates structured reasoning scaffolds without seeing sensitive patient data, while a local LLM uses them for evidence acquisition and final decisions.
  • Experiments indicate CARE outperforms several baseline approaches (including single-pass LLMs and other agentic pipelines) across key metrics, showing improved robustness to conflicting clinical evidence while maintaining privacy constraints.

Abstract

Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.