REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

arXiv cs.AI / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • REVEAL (REtinal-risk Vision-Language Early Alzheimer's Learning) is an arXiv framework that aligns retinal color fundus images with individualized, disease-specific risk profiles to predict incident Alzheimer’s disease (AD) and dementia.
  • The approach addresses a key gap in prior work by jointly modeling multimodal inputs—retinal morphometric features and structured questionnaire-based risk factors—instead of treating imaging and risk factors separately.
  • It converts real-world questionnaire risk factors into clinically interpretable narrative text that can be used with pretrained vision-language models (VLMs), enabling cross-modal learning.
  • A group-aware contrastive learning (GACL) strategy is proposed to cluster patients with similar retinal morphometry and risk factors, improving alignment of shared patterns across modalities.
  • The method reports substantially better performance than state-of-the-art retinal imaging models with clinical text encoders and than general-purpose VLMs, with predictions occurring on average ~8 years before diagnosis.

Abstract

The retina provides a unique, noninvasive window into Alzheimer's disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer's Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.