MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

arXiv cs.AI / 2026/3/24

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper introduces MARCUS, an agentic multimodal vision-language model designed to interpret cardiac data end-to-end, handling ECGs, echocardiograms, and CMR both individually and together as multimodal inputs.
  • MARCUS uses a hierarchical agentic architecture with modality-specific expert vision-language models coordinated by a multimodal orchestrator, combining domain-trained visual encoders with multi-stage language-model optimization.
  • Trained on 13.5M images (including ECGs, echocardiograms, and CMR) and a curated dataset of 1.6M questions, MARCUS reports state-of-the-art results and improvements over frontier models on internal (Stanford) and external (UCSF) cohorts.
  • Reported accuracies range from 87–91% for ECG to 67–86% for echocardiography and 85–88% for CMR, with multimodal performance reaching 70% accuracy—substantially higher than compared frontier systems.
  • The authors claim robustness against “mirage reasoning” (unintended textual or hallucinated visual rationales) and state they are releasing models, code, and benchmarks as open source.

Abstract

Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.