Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces Echo-{alpha}, an agentic multimodal reasoning model designed to improve ultrasound interpretation by combining accurate lesion localization with holistic clinical reasoning.
  • Echo-{alpha} uses an invoke-and-reason framework that coordinates organ-specific detector outputs, integrates them with global visual context, and produces grounded diagnostic decisions rather than relying on detector-only inference.
  • Training proceeds via a nine-task supervised curriculum followed by sequential reinforcement learning with different reward trade-offs to obtain variants focused on lesion grounding and final diagnosis.
  • On multi-center renal and breast ultrasound benchmarks (including cross-center tests), Echo-{alpha} outperforms baseline methods on both grounding and diagnosis, with the reported metrics indicating stronger generalization across centers.
  • The authors argue that agentic multimodal reasoning can convert specialized detectors into verifiable clinical evidence, and they provide a public repository for the work.

Abstract

Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer strong localization but limited reasoning, whereas multimodal large language models (MLLMs) provide flexible reasoning but weak grounding in specialized medical domains. We present Echo-{\alpha}, an agentic multimodal reasoning model for ultrasound interpretation that unifies these strengths within an invoke-and-reason framework. Echo-{\alpha} is trained to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the resulting evidence into grounded diagnostic decisions beyond detector-only inference. This behavior is established through a nine-task supervised curriculum and then refined by sequential reinforcement learning under different reward trade-offs, yielding Echo-{\alpha}-Grounding for lesion anchoring and Echo-{\alpha}-Diagnosis for final diagnosis. On multi-center renal and breast ultrasound benchmarks, Echo-{\alpha} outperforms competitive baselines on both grounding and diagnosis. In particular, on cross-center test sets, Echo-{\alpha}-Grounding attains 56.73%/43.78% F1@0.5 and Echo- {\alpha}-Diagnosis reaches 74.90%/49.20% overall accuracy on renal/breast ultrasound. These results suggest that agentic multimodal reasoning can turn specialized detectors into verifiable clinical evidence, offering a practical route toward ultrasound AI systems that are more accurate, interpretable, and transferable. The repository is at https://github.com/MiliLab/Echo-Alpha.