SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The SemioLLM study evaluates eight large language models on an epilepsy diagnostic-reasoning task that maps seizure-description phrases from unstructured clinical narratives to one of seven seizure onset zones using likelihood estimates.
  • Results show that, with prompt engineering and clinician-guided chain-of-thought style reasoning, several models can achieve performance that often matches ground truth and can approach clinician-level accuracy.
  • Model performance is strongly influenced by factors including clinical in-context impersonation, narrative length, and language context, producing notable percentage swings across conditions.
  • Expert review of reasoning outputs finds that correct predictions can still rely on hallucinated knowledge and inaccurate source citation, highlighting interpretability and reliability gaps for clinical deployment.
  • The paper proposes SemioLLM as a scalable, domain-adaptable evaluation framework for clinical settings where diagnostic information is embedded in free-text narratives.

Abstract

Large Language Models (LLMs) have been shown to encode clinical knowledge. Many evaluations, however, rely on structured question-answer benchmarks, overlooking critical challenges of interpreting and reasoning about unstructured clinical narratives in real-world settings. In this study we task eight Large Language models including two medical models (GPT-3.5, GPT-4, Mixtral-8x7B, Qwen-72B, LlaMa2, LlaMa3, OpenBioLLM, Med42) with a core diagnostic task in epilepsy: mapping seizure description phrases, after targeted filtering and standardization, to one of seven possible seizure onset zones using likelihood estimates. Most models yield results that often match the ground truth and even approach clinician-level performance after prompt engineering. Specifically, clinician-guided chain-of-thought reasoning leading to the most consistent improvements. Performance was further strongly modulated by clinical in-context impersonation, narrative length and language context (13.7%, 32.7% and 14.2% performance variation, respectively). However, expert analysis of reasoning outputs revealed that correct prediction can be based on hallucinated knowledge and inaccurate source citation, underscoring the need to improve interpretability of LLMs in clinical use. Overall, SemioLLM provides a scalable, domain-adaptable framework for evaluating LLMs in clinical disciplines where unstructured verbal descriptions encode diagnostic information. By identifying both the strengths and limitations of LLMs, our work contributes to testing the applicability of foundational AI systems for healthcare.