Geometry-Aware CLIP Retrieval via Local Cross-Modal Alignment and Steering
arXiv cs.CV / 4/21/2026
📰 NewsModels & Research
Key Points
- The paper argues that CLIP-based retrieval failures often come from local geometric inconsistencies in embedding space, which can cause systematic mis-ordering of similar items.
- It proposes geometry-aware retrieval by reframing the task as neighborhood alignment rather than only pointwise similarity.
- The method includes neighborhood-level re-ranking using Hungarian matching to reward structural consistency among retrieved neighbors.
- It also introduces query-conditioned local steering, using directions extracted from contrastive neighborhoods around the query to reshape the retrieval neighborhood.
- Experiments show improved performance on attribute-binding and compositional retrieval tasks, with improved quality and controllability achieved at inference time without retraining.
Related Articles

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA
Where is Grok-2 Mini and Grok-3 (mini)?
Reddit r/LocalLLaMA