Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that vision-encoder selection for Vision-Language Models (VLMs) lacks a principled framework, despite many existing experiments on combining vision encoders and LLMs.
Experiments across 19 pretrained vision encoders show that common heuristics (largest encoder size or highest zero-shot accuracy) have weak to only moderate correlation with the resulting VLM performance.
The authors propose that structural similarity across modalities is a key, previously underappreciated factor, and they quantify it using Gromov-Wasserstein distance.
Theoretical analysis links the learnability of cross-modality mapping to the Gromov-Wasserstein distance, and empirical results from 60+ full VLM training runs confirm that the inference-only metric predicts final performance more accurately than other selection methods.
The approach enables efficient prediction of VLM outcomes before running full training, reducing the cost of model selection.

Abstract

Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.