Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus
arXiv cs.AI / 4/28/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study evaluates whether an agentic LLM system can perform longitudinal clinical reasoning for multiple myeloma decisions using large, heterogeneous patient records, and compares it with single-pass RAG, iterative RAG, and full-context input.
- On 469 question pairs spanning 48 templates and three complexity levels (with labels from oncologists and senior adjudication), the agentic system achieved 79.6% concordance, outperforming all baselines while iterative RAG and full-context approaches plateaued around 75.4–75.8%.
- Improvements were larger for harder, criteria-based synthesis questions and for longer patient record histories, with the best gains observed for the longest records (top decile).
- Although the overall system error rate (12.2%) was similar to expert disagreement (13.6%), system errors were more clinically significant than expert disagreements, implying the need for prospective evaluation in routine care.
- External validation included MIMIC-IV, but the authors emphasize that prospective studies are required to confirm patient benefit before clinical deployment.
Related Articles

The foundational UK sovereign-AI patents are filed. The collaboration door is open.
Dev.to

Building a Shopify app with Claude Code — spec-driven development and pricing design
Dev.to

The AI Habit That Pays Dividends (And Takes Zero Extra Time)
Dev.to

From Chaos to Clarity: AI-Powered Client Portals for Designers
Dev.to

I Used to Treat AI Like a Search Engine. Then I Realized I Was Doing It Wrong.
Dev.to