L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

arXiv cs.AI / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces L2D-Clinical, a framework that learns when a specialized BERT-based clinical text classifier should defer to a general-purpose LLM using uncertainty signals and text characteristics.
  • It addresses the limitation of prior “learning to defer” approaches that assumed a single expert (human) is universally superior, instead showing that BERT and LLMs can each dominate on different instances.
  • On ADE detection, where BioBERT (F1=0.911) beats the LLM (F1=0.765), L2D-Clinical improves over BERT by achieving F1=0.928 through deferring only 7% of cases to exploit LLM high recall.
  • On treatment outcome classification (MIMIC-IV), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887), the method reaches F1=0.980 by deferring 16.8% of cases to the LLM.
  • The study emphasizes cost-aware deployment by selectively leveraging LLM strengths while minimizing API usage rather than routing all inputs to the LLM.

Abstract

Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM's high recall compensates for BERT's misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8\% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.