DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset

arXiv cs.CV / 5/6/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces DALPHIN, the first open multicenter benchmark designed to evaluate digital pathology AI copilots using an independently benchmarkable dataset.
  • DALPHIN contains 1,236 images across 300 cases, covering 130 diagnoses, 6 countries, and 14 subspecialties, enabling evaluation across a broad clinical spectrum.
  • The authors include a human benchmark from 31 pathologists in 10 countries with varying expertise, and test both general-purpose models (GPT-5, Gemini 2.5 Pro) and a pathology-specific copilot (PathChat+).
  • Results show PathChat+ matches expert-level performance in 4 of 6 tasks, Gemini in 2 of 6 tasks, and GPT in 1 of 6 tasks, highlighting uneven readiness across systems.
  • The benchmark is publicly released with sequestered ground truth and an evaluation platform, with data and methods available via dalphin.grand-challenge.org to support long-term, robust comparisons.

Abstract

Foundation models with visual question answering capabilities for digital pathology are emerging. Such unprecedented technology requires independent benchmarking to assess its potential in assisting pathologists in routine diagnostics. We created DALPHIN, the first multicentric open benchmark for pathology AI copilots, comprising 1236 images from 300 cases, spanning 130 rare to common diagnoses, 6 countries, and 14 subspecialties. The DALPHIN design and dataset are introduced alongside a human performance benchmark of 31 pathologists from 10 countries with varying expertise. We report results for two general-purpose (GPT-5, Gemini 2.5 Pro) and one pathology-specific copilot (PathChat+) for sequential and independent answer generation. We observed no statistically significant difference from expert-level performance in four of six tasks for PathChat, 2/6 tasks for Gemini, and 1/6 tasks for GPT. DALPHIN is publicly released with sequestered, indirectly accessible ground truth to foster robust and enduring benchmarking. Data, methods, and the evaluation platform are accessible through dalphin.grand-challenge.org.