Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis

arXiv cs.AI / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Gastric-X introduces a large-scale multimodal benchmark dataset for gastric cancer analysis with 1.7K cases, including resting and dynamic CT scans, endoscopic images, biochemical indicators, diagnostic notes, and tumor bounding boxes to reflect realistic clinical workflows.
  • The benchmark evaluates five core tasks—Visual Question Answering, report generation, cross-modal retrieval, disease classification, and lesion localization—to simulate critical stages of clinical decision-making.
  • The study probes how current vision-language models correlate biochemical signals with spatial tumor features and textual reports, aiming to align AI reasoning with physicians’ cognitive processes.
  • Gastric-X is positioned as a resource to drive the development of next-generation medical VLMs and bridge research with real-world clinical practice.

Abstract

Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.