Geometry-Aware CLIP Retrieval via Local Cross-Modal Alignment and Steering

arXiv cs.CV / 4/21/2026

📰 NewsModels & Research

Key Points

  • The paper argues that CLIP-based retrieval failures often come from local geometric inconsistencies in embedding space, which can cause systematic mis-ordering of similar items.
  • It proposes geometry-aware retrieval by reframing the task as neighborhood alignment rather than only pointwise similarity.
  • The method includes neighborhood-level re-ranking using Hungarian matching to reward structural consistency among retrieved neighbors.
  • It also introduces query-conditioned local steering, using directions extracted from contrastive neighborhoods around the query to reshape the retrieval neighborhood.
  • Experiments show improved performance on attribute-binding and compositional retrieval tasks, with improved quality and controllability achieved at inference time without retraining.

Abstract

CLIP retrieval is typically framed as a pointwise similarity problem in a shared embedding space. While CLIP achieves strong global cross-modal alignment, many retrieval failures arise from local geometric inconsistencies: nearby items are incorrectly ordered, leading to systematic confusions (e.g., pentagon vs. hexagon) and produces diffuse, weakly controlled result sets. Prior work largely optimizes for point wise relevance or finetuning to mitigate these problems. We instead view retrieval as a problem of neighborhood alignment. Our work introduces (1) neighborhood-level re-ranking via Hungarian matching, which rewards structural consistency; (2) query-conditioned local steering, where directions derived from contrastive neighborhoods around the query reshape retrieval. We show that these techniques improve retrieval performance on attribute-binding and compositional retrieval tasks. Together, these methods operate on local neighborhoods but serve different roles: re-ranking rewards alignment whereas local steering controls neighborhood structure. This shows that retrieval quality and controllability depend critically on local structure, which can be exploited at inference time without retraining.