L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

arXiv cs.AI / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces L2D-Clinical, a framework that learns when a specialized BERT-based clinical text classifier should defer to a general-purpose LLM using uncertainty signals and text characteristics.
It addresses the limitation of prior “learning to defer” approaches that assumed a single expert (human) is universally superior, instead showing that BERT and LLMs can each dominate on different instances.
On ADE detection, where BioBERT (F1=0.911) beats the LLM (F1=0.765), L2D-Clinical improves over BERT by achieving F1=0.928 through deferring only 7% of cases to exploit LLM high recall.
On treatment outcome classification (MIMIC-IV), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887), the method reaches F1=0.980 by deferring 16.8% of cases to the LLM.
The study emphasizes cost-aware deployment by selectively leveraging LLM strengths while minimizing API usage rather than routing all inputs to the LLM.

Abstract

Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM's high recall compensates for BERT's misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8\% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.

Black Hat Asia

AI Business

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Dev.to

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

Dev.to

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

Key Points

Abstract

Related Articles

Black Hat Asia

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer