Quantifying the human visual exposome with vision language models

arXiv cs.CV / 5/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study tackles the lack of direct, objective quantification of the visual environment’s role in mental health by moving beyond coarse location proxies and self-reports.
  • It combines ecological momentary assessment with vision-language models (VLMs) to estimate the semantic “richness” of daily visual experience from participant photos.
  • Using 2,674 participant-generated photographs, the VLM-derived greenness estimates robustly predicted both momentary affect and chronic stress, aligning with existing benchmarks.
  • The authors build a semi-autonomous LLM-driven pipeline that mines over seven million scientific publications to extract nearly 1,000 environment-related features linked to mental health.
  • On real-world imagery, VLM-derived context ratings showed significant correlations with affect and stress for up to 33% of the extracted contextual signals, supporting scalable visual exposomics.

Abstract

The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.