Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology

arXiv cs.CV / 5/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The pilot study tests multimodal large language models (MLLMs) for zero-shot recognition of pathological seizure-related movements from clinical video recordings using 20 ILAE-defined semiological features.
  • MLLMs beat baseline fine-tuned CNN and ViT models on 13 of 18 features without task-specific training, with stronger performance on salient postural/contextual cues and weaker performance on subtle high-frequency movements.
  • Targeted preprocessing—such as facial cropping, pose estimation, and audio denoising—improved results on 10 of 20 features, suggesting that domain-specific signal enhancement can mitigate model blind spots.
  • Expert review indicated that 94.3% of MLLM explanations for correctly predicted cases had at least 60% faithfulness scores, making the generated rationales broadly consistent with epileptologist reasoning.
  • The work provides a publicly available codebase and proposes an interpretable, efficient route to adapt general-purpose MLLMs for specialized clinical video analysis.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated robust capabilities in recognizing everyday human activities, yet their potential for analyzing clinically significant involuntary movements in neurological disorders remains largely unexplored. This pilot study evaluates the capability of MLLMs for automated recognition of pathological movements in seizure videos. We assessed the zero-shot performance of state-of-the-art MLLMs on 20 ILAE-defined semiological features across 90 clinical seizure recordings. MLLMs outperformed fine-tuned Convolutional Neural Network (CNN) and Vision Transformer (ViT) baseline models on 13 of 18 features without task-specific training, demonstrating particular strength in recognizing salient postural and contextual features while struggling with subtle, high-frequency movements. Feature-targeted signal enhancement (facial cropping, pose estimation, audio denoising) improved performance on 10 of 20 features. Expert evaluation showed that 94.3 percent of MLLM-generated explanations for correctly predicted cases achieved at least 60 percent faithfulness scores, aligning with epileptologist reasoning. These findings demonstrate the potential of adapting general-purpose MLLMs for specialized clinical video analysis through targeted preprocessing strategies, offering a path toward interpretable, efficient diagnostic assistance. Our code is publicly available at https://github.com/LinaZhangUCLA/PathMotionMLLM.