Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

arXiv cs.CV / 4/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study tests whether autoregressive video modeling, scaled like large generative models, can generalize in a zero-shot way to medical imaging tasks without any medical-data training.
  • A large vision model (LVM) is evaluated on four representative medical imaging problems—organ segmentation, denoising, super-resolution, and motion prediction—showing competitive results even with no domain-specific fine-tuning.
  • In CT-based radiotherapy motion prediction, the model forecasts future 3D CT phases directly from earlier phases of 4D CT, producing anatomically consistent outputs that reflect patient-specific respiratory dynamics with realistic temporal coherence.
  • The experiments use 4D CT data from 122 patients (over 1,820 3D CT volumes), and the motion-prediction results surpass specialized DVF-based and generative baselines in spatial accuracy, reaching state-of-the-art performance.
  • Overall, the findings suggest emerging zero-shot “learner/reasoner” behavior for medical video modeling and point to video-model-based medical foundation models as a unifying direction.

Abstract

Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.