Rethinking Patient Education as Multi-turn Multi-modal Interaction
arXiv cs.AI / 4/17/2026
📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The paper argues that patient education in radiology is more complex than typical static medical multimodal tasks because it must ground explanations in evidence, guide what patients should look at, and adapt to confusion or distress.
- It introduces MedImageEdu, a new benchmark designed for multi-turn, evidence-grounded radiology patient education using a DoctorAgent–PatientAgent interaction framework with a hidden patient profile (e.g., education level and health literacy).
- The system can generate visual support by issuing drawing instructions grounded in the radiology report, case images, and the patient’s question, then uses returned images to produce a final multimodal response with accessible explanations.
- Evaluations across multiple vision-language model agents show three recurring issues: language fluency can exceed faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are particularly challenging.
- MedImageEdu is positioned as a controlled testbed to assess whether multimodal agents can teach from evidence rather than only produce text-based answers.



