Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current medical vision-language pretraining for chest X-rays is limited because it treats radiographs as context-agnostic and largely ignores radiologists’ gaze patterns used during diagnosis.
It introduces CoGaze, a context- and gaze-guided pretraining framework that adds a context-infused vision encoder, multi-level semantic alignment objectives, and disease-aware cross-modal priors.
CoGaze uses radiologists’ gaze as probabilistic priors to guide model attention toward diagnostically salient regions, aiming to better reflect real diagnostic workflows.
Reported experiments show consistent improvements over state of the art across tasks, including gains in free-text/structured report generation, zero-shot classification AUROC, and image-text retrieval metrics.
The authors provide code publicly for reproducibility and further experimentation with the CoGaze approach.

Abstract

Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

Dev.to

Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

Key Points

Abstract

Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer