Don't let the bot play doctor! AI gets early diagnoses wrong 80% of the time
'LLMs should not be trusted for patient-facing diagnostic reasoning,' boffins advise
People ask AI for all kinds of advice, including the kind of questions you'd ask a physician. However, the next time you're tempted to query ChatGPT if that growth on your face is skin cancer, consider this: research shows today's leading AI models fail at early differential diagnosis in more than 8 out of 10 cases.
Led by Harvard medical student Arya Rao, a research team published in JAMA Network Open this week the results of a study that examined 21 leading off-the-shelf AI models in 29 standardized clinical vignettes. The bots all did fairly well when provided a full portfolio of medical information and asked to make a final diagnosis, with leading models correct 91 percent of the time. Early differential diagnosis, where clinicians try to rule out certain conditions while weighing various possibilities, is where that more than 80 percent failure rate comes in.
"Every model we tested failed on the vast majority of cases," Rao told The Register in an email. "That's the stage where uncertainty matters most, and it's where these systems are weakest."
In other words, it's the midnight anxiety-fueled WebMD rabbit hole of yesterday all over again, just supercharged with AI that's probably even more likely to get things wrong than you are without it.
"Our results suggest today's off-the-shelf LLMs should not be trusted for patient-facing diagnostic reasoning without structured comprehensive human review, and has significant limitations when used by patients for self-diagnosis," paper coauthor and Massachusetts General Hospital radiologist, Dr. Marc Succi, told us in an email.
"They can project confidence without showing robust reasoning, especially around differential diagnosis," Succi said, adding that such confidence can further inflame the worries of patients with stress and anxiety issues.
Rao pointed out that failure in the paper didn't necessarily mean that the AI completely bombed the diagnosis, only that it didn't provide a fully correct answer. She said that it may be more generous to measure the AIs by their raw accuracy as a proportion correct in each case, which ranged from 63 to 78 percent - far better than the stricter failure metric highlighted in the paper.
The raw data, Rao told us, "suggests that models were often partially correct, getting some but not all of the right answers, even when they failed to produce a fully correct differential under the stricter failure-rate definition."
- AI doctor's assistant is easily swayed to change prescriptions, give bad medical advice
- AI chatbots are no better at medical advice than a search engine
- AI models hallucinate, and doctors are OK with that
- ChatGPT is playing doctor for a lot of US residents, and OpenAI smells money
That aside, the team argues that the stricter failure-rate definition still deserves attention, especially given that AI bots are often being flogged as frontline medical care agents designed to narrow down diagnoses before handing patients off to a human for more particular assistance.
"Marketing LLMs as diagnostic agents risks fostering false confidence precisely where they are least reliable," the team explained. "Persistent failures in generating differential diagnoses and navigating uncertainty show that LLMs cannot yet be trusted in frontline decision-making."
Succi also said that higher success rates in final diagnosis shouldn't be reassuring, warning that such data can create a misleading sense of safety and model competence.
"Real clinical reasoning starts earlier, when ambiguity is highest, and that is exactly where they remain weakest," Succi said. "Even if you get to the final answer eventually, the wrong differential can result in delays in care, unnecessary procedures with complications, high costs, and much more."
In other words, the next time you're going in circles about a health concern, don't go online unless it's to find the number to your doctor so you can get a proper diagnosis from a human. AI isn't ready yet. ®
More about
More about
Narrower topics
- AIOps
- Amazon Bedrock
- Anthropic
- Astronomy
- Biotech
- ChatGPT
- Climate Change
- Contact Tracing
- CRISPR
- DeepSeek
- Fusion Power
- Gemini
- Google AI
- GPT-3
- GPT-4
- Laser
- Machine Learning
- MCubed
- Meteorology
- Neil Gehrels Swift Observatory
- Neural Networks
- NLP
- Pfizer
- Physics
- Pi
- Renewables
- Retrieval Augmented Generation
- Star Wars
- Superconductor
- Tensor Processing Unit
- TOPS
Broader topics
More about
More about
More about
Narrower topics
- AIOps
- Amazon Bedrock
- Anthropic
- Astronomy
- Biotech
- ChatGPT
- Climate Change
- Contact Tracing
- CRISPR
- DeepSeek
- Fusion Power
- Gemini
- Google AI
- GPT-3
- GPT-4
- Laser
- Machine Learning
- MCubed
- Meteorology
- Neil Gehrels Swift Observatory
- Neural Networks
- NLP
- Pfizer
- Physics
- Pi
- Renewables
- Retrieval Augmented Generation
- Star Wars
- Superconductor
- Tensor Processing Unit
- TOPS



