Don't let the bot play doctor! AI gets early diagnoses wrong 80% of the time

The Register / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Read original →

共有:

Key Points

The article warns that LLM-based systems used for patient-facing diagnostic reasoning can be unreliable, with an early diagnosis error rate cited as about 80%.
It emphasizes that LLMs should not be trusted to generate or justify medical diagnoses directly for patients without appropriate clinical oversight.
The piece frames the guidance as an advice from researchers (“boffins”) about safety and responsible deployment of AI in healthcare.
It highlights the broader risk that conversational AI may produce plausible-sounding but incorrect medical outputs, underscoring the need for stronger validation and safeguards.

AI + ML

Don't let the bot play doctor! AI gets early diagnoses wrong 80% of the time

'LLMs should not be trusted for patient-facing diagnostic reasoning,' boffins advise

Brandon Vigliarolo

Wed 15 Apr 2026 // 21:07 UTC

People ask AI for all kinds of advice, including the kind of questions you'd ask a physician. However, the next time you're tempted to query ChatGPT if that growth on your face is skin cancer, consider this: research shows today's leading AI models fail at early differential diagnosis in more than 8 out of 10 cases.

Led by Harvard medical student Arya Rao, a research team published in JAMA Network Open this week the results of a study that examined 21 leading off-the-shelf AI models in 29 standardized clinical vignettes. The bots all did fairly well when provided a full portfolio of medical information and asked to make a final diagnosis, with leading models correct 91 percent of the time. Early differential diagnosis, where clinicians try to rule out certain conditions while weighing various possibilities, is where that more than 80 percent failure rate comes in.

"Every model we tested failed on the vast majority of cases," Rao told The Register in an email. "That's the stage where uncertainty matters most, and it's where these systems are weakest."

In other words, it's the midnight anxiety-fueled WebMD rabbit hole of yesterday all over again, just supercharged with AI that's probably even more likely to get things wrong than you are without it.

"Our results suggest today's off-the-shelf LLMs should not be trusted for patient-facing diagnostic reasoning without structured comprehensive human review, and has significant limitations when used by patients for self-diagnosis," paper coauthor and Massachusetts General Hospital radiologist, Dr. Marc Succi, told us in an email.

"They can project confidence without showing robust reasoning, especially around differential diagnosis," Succi said, adding that such confidence can further inflame the worries of patients with stress and anxiety issues.

Rao pointed out that failure in the paper didn't necessarily mean that the AI completely bombed the diagnosis, only that it didn't provide a fully correct answer. She said that it may be more generous to measure the AIs by their raw accuracy as a proportion correct in each case, which ranged from 63 to 78 percent - far better than the stricter failure metric highlighted in the paper.

The raw data, Rao told us, "suggests that models were often partially correct, getting some but not all of the right answers, even when they failed to produce a fully correct differential under the stricter failure-rate definition."

That aside, the team argues that the stricter failure-rate definition still deserves attention, especially given that AI bots are often being flogged as frontline medical care agents designed to narrow down diagnoses before handing patients off to a human for more particular assistance.

"Marketing LLMs as diagnostic agents risks fostering false confidence precisely where they are least reliable," the team explained. "Persistent failures in generating differential diagnoses and navigating uncertainty show that LLMs cannot yet be trusted in frontline decision-making."

Succi also said that higher success rates in final diagnosis shouldn't be reassuring, warning that such data can create a misleading sense of safety and model competence.

"Real clinical reasoning starts earlier, when ambiguity is highest, and that is exactly where they remain weakest," Succi said. "Even if you get to the final answer eventually, the wrong differential can result in delays in care, unnecessary procedures with complications, high costs, and much more."

In other words, the next time you're going in circles about a health concern, don't go online unless it's to find the number to your doctor so you can get a proper diagnosis from a human. AI isn't ready yet. ®

More about

More like these

More about

Narrower topics

Broader topics

Self-driving Car

More about

More like these

More about

Narrower topics

Broader topics

Self-driving Car

TIP US OFF

Send us news

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/16DailyView insight →

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026

Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

Dev.to

NEW PROMPT INJECTION

Dev.to

Don't let the bot play doctor! AI gets early diagnoses wrong 80% of the time

Key Points

AI + ML

Don't let the bot play doctor! AI gets early diagnoses wrong 80% of the time

'LLMs should not be trusted for patient-facing diagnostic reasoning,' boffins advise

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

💡 Insights using this article

Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

How AI Interview Assistants Are Changing Job Preparation in 2026

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

NEW PROMPT INJECTION

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer