OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

arXiv cs.CL / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces OralMLLM-Bench, a comprehensive benchmark aimed at assessing how multimodal large language models (MLLMs) perform on cognitive processes needed for dental radiographic analysis.
  • The benchmark covers three dental imaging modalities (periapical, panoramic, and lateral cephalometric radiographs) and evaluates four cognitive categories: perception, comprehension, prediction, and decision-making.
  • It includes 27 clinically grounded tasks sourced from public datasets, with manually curated annotations and clinician-verified evaluation inputs (3,820 assessments).
  • Six leading “frontier” MLLMs, including GPT-5.2 and GLM-4.6, are tested to measure gaps versus clinician performance, identify strengths/limitations, and characterize common failure modes.
  • The authors provide improvement recommendations and position the dataset as a resource for building safer, cognition-aligned AI systems that fit real dental workflows.

Abstract

Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cognitive processes required for radiographic analysis remains unclear. Here, we present a comprehensive benchmark to evaluate the cognitive capabilities of MLLMs in dental radiographic analysis. It spans three critical imaging modalities, i.e., periapical, panoramic, and lateral cephalometric radiographs, and defines four cognitive categories: perception, comprehension, prediction, and decision-making. The benchmark comprises 27 clinically grounded tasks derived from public datasets, with manually curated annotations and 3,820 clinician assessments for evaluation. Six frontier MLLMs, including GPT-5.2 and GLM-4.6, are evaluated. We demonstrate the performance gap between MLLMs and clinicians in dental practice, delineate model strengths and limitations, characterize failure patterns, and provide recommendations for improvement. This data resource will facilitate the development of next-generation artificial intelligence systems aligned with clinical cognition, safety requirements, and workflow complexity in dental practice.