Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
arXiv cs.CV / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that vision-encoder selection for Vision-Language Models (VLMs) lacks a principled framework, despite many existing experiments on combining vision encoders and LLMs.
- Experiments across 19 pretrained vision encoders show that common heuristics (largest encoder size or highest zero-shot accuracy) have weak to only moderate correlation with the resulting VLM performance.
- The authors propose that structural similarity across modalities is a key, previously underappreciated factor, and they quantify it using Gromov-Wasserstein distance.
- Theoretical analysis links the learnability of cross-modality mapping to the Gromov-Wasserstein distance, and empirical results from 60+ full VLM training runs confirm that the inference-only metric predicts final performance more accurately than other selection methods.
- The approach enables efficient prediction of VLM outcomes before running full training, reducing the cost of model selection.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

When a Bottling Line Stops at 2 A.M., the Agent That Wins Is the One That Finds the Right Replacement Part
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to