From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs

arXiv cs.AI / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles hallucination safety in medical LLMs by focusing on diabetic retinopathy decision settings where evidence can be insufficient or conflicting.
  • It introduces RETINA-SAFE, a retinal-evidence benchmark of 12,522 samples organized into three evidence-relation tasks: E-Align, E-Conflict, and E-Gap.
  • The authors propose ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box framework that first triages cases as Safe vs Unsafe and then attributes unsafe cases to contradiction-driven vs evidence-gap risk types.
  • ECRT uses internal representations and logit shifts under CTX/NOCTX conditions with class-balanced training, and evaluates robustness across multiple model backbones using evidence-grouped (not patient-disjoint) splits.
  • Results show improved Stage-1 balanced accuracy (+0.15 to +0.19 over external uncertainty and self-consistency baselines, +0.02 to +0.07 over the strongest adapted supervised baseline), indicating interpretable, evidence-grounded risk triage as a practical direction.

Abstract

Hallucinations in medical large language models (LLMs) remain a safety-critical issue, particularly when available evidence is insufficient or conflicting. We study this problem in diabetic retinopathy (DR) decision settings and introduce RETINA-SAFE, an evidence-grounded benchmark aligned with retinal grading records, comprising 12,522 samples. RETINA-SAFE is organized into three evidence-relation tasks: E-Align (evidence-consistent), E-Conflict (evidence-conflicting), and E-Gap (evidence-insufficient). We further propose ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box detection framework: Stage 1 performs Safe/Unsafe risk triage, and Stage 2 refines unsafe cases into contradiction-driven versus evidence-gap risks. ECRT leverages internal representation and logit shifts under CTX/NOCTX conditions, with class-balanced training for robust learning. Under evidence-grouped (not patient-disjoint) splits across multiple backbones, ECRT provides strong Stage-1 risk triage and explicit subtype attribution, improves Stage-1 balanced accuracy by +0.15 to +0.19 over external uncertainty and self-consistency baselines and by +0.02 to +0.07 over the strongest adapted supervised baseline, and consistently exceeds a single-stage white-box ablation on Stage-1 balanced accuracy. These findings support white-box internal signals grounded in retinal evidence as a practical route to interpretable medical LLM risk triage.