Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Proposes a post-hoc framework for vision-language model embedding spaces that extracts and names semantic hierarchies from class centroids using clustering and concept-bank matching.
Introduces quantitative methods to verify whether the induced hierarchy aligns with human ontologies, using tree/edge consistency measures and uncertainty-aware hierarchical inference for utility evaluation.
Presents an ontology-guided post-hoc alignment approach that learns a lightweight transformation of the embedding space, leveraging UMAP to shape target neighborhoods toward a desired hierarchy.
Finds systematic modality effects across 13 pretrained VLMs and 4 datasets: image encoders tend to be more discriminative, while text encoders produce hierarchies that better match human taxonomies.
Highlights an observed trade-off between zero-shot accuracy and ontological plausibility, offering routes for improving semantic alignment in shared image-text embeddings.

Abstract

Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.