Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings
arXiv cs.LG / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Proposes a post-hoc framework for vision-language model embedding spaces that extracts and names semantic hierarchies from class centroids using clustering and concept-bank matching.
- Introduces quantitative methods to verify whether the induced hierarchy aligns with human ontologies, using tree/edge consistency measures and uncertainty-aware hierarchical inference for utility evaluation.
- Presents an ontology-guided post-hoc alignment approach that learns a lightweight transformation of the embedding space, leveraging UMAP to shape target neighborhoods toward a desired hierarchy.
- Finds systematic modality effects across 13 pretrained VLMs and 4 datasets: image encoders tend to be more discriminative, while text encoders produce hierarchies that better match human taxonomies.
- Highlights an observed trade-off between zero-shot accuracy and ontological plausibility, offering routes for improving semantic alignment in shared image-text embeddings.



