Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

arXiv cs.CV / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces domain conformal bounds (DCB) to assess whether domains diverge in unknown causal factors, enabling objective evaluation of cross-domain generalization without access to data metadata.
It proposes GenEval, a multimodal vision-language model approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and improve single-source domain generalization.
GenEval is evaluated on eight diabetic retinopathy datasets and two resting-state fMRI seizure onset zone datasets, achieving average accuracies of 69.2% for DR and 81% for SOZ, outperforming baselines by 9.4% and 1.8%, respectively.
The work frames a generalizable framework for assessing domain shifts and enhancing SDG in medical imaging with multimodal learning, potentially applicable beyond the tested modalities.

Abstract

Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.