DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset
arXiv cs.CV / 5/6/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces DALPHIN, the first open multicenter benchmark designed to evaluate digital pathology AI copilots using an independently benchmarkable dataset.
- DALPHIN contains 1,236 images across 300 cases, covering 130 diagnoses, 6 countries, and 14 subspecialties, enabling evaluation across a broad clinical spectrum.
- The authors include a human benchmark from 31 pathologists in 10 countries with varying expertise, and test both general-purpose models (GPT-5, Gemini 2.5 Pro) and a pathology-specific copilot (PathChat+).
- Results show PathChat+ matches expert-level performance in 4 of 6 tasks, Gemini in 2 of 6 tasks, and GPT in 1 of 6 tasks, highlighting uneven readiness across systems.
- The benchmark is publicly released with sequestered ground truth and an evaluation platform, with data and methods available via dalphin.grand-challenge.org to support long-term, robust comparisons.
Related Articles

The Semantic Airgap: Why "Hinglish" is the Ultimate Zero-Day for Voice Agents
Dev.to

A protocol for auditing AI agent harnesses
Dev.to

Anthropic says it hit a $30 billion revenue run rate after 'crazy' 80x growth
VentureBeat

Anthropic prompt caching cut our RCA cost by 90%
Dev.to

OpenAI brings GPT-5-class reasoning to real-time voice — and it changes what voice agents can actually orchestrate
VentureBeat