Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage

arXiv cs.CL / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Dental-TriageBench is introduced as the first expert-annotated benchmark for reasoning-driven multimodal dental triage, using 246 de-identified cases from real outpatient workflows.
  • Each case includes golden reasoning trajectories and hierarchical triage labels, enabling evaluation of “complete referral plans” that integrate complaints with radiographic evidence (OPG).
  • The study benchmarks 19 multimodal LLMs against three junior dentists and reports a substantial human–model gap, especially for fine-grained treatment-level triage.
  • Analysis indicates effective triage depends on both complaint and OPG information, while model mistakes often occur in cases with multiple referral domains due to overly narrow referral sets and omission-heavy errors.

Abstract

Dental triage is a safety-critical clinical routing task that requires integrating multimodal clinical information (e.g., patient complaints and radiographic evidence) to determine complete referral plans. We present Dental-TriageBench, the first expert-annotated benchmark for reasoning-driven multimodal dental triage. Built from authentic outpatient workflows, it contains 246 de-identified cases annotated with expert-authored golden reasoning trajectories, together with hierarchical triage labels. We benchmark 19 proprietary, open-source, and medical-domain MLLMs against three junior dentists serving as the human baseline, and find a substantial human--model gap, on fine-grained treatment-level triage. Further analyses show that accurate triage requires both complaint and OPG information, and that model errors concentrate on cases with multiple referral domains, where MLLMs tend to produce overly narrow referral sets and omission-heavy errors. Dental-TriageBench provides a realistic testbed for developing multimodal clinical AI systems that are more clinically grounded, coverage-aware, and safer for downstream care.