DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
arXiv cs.CV / 4/22/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces DINO Eats CLIP (DEC), a new framework for open-set 3D object retrieval that combines a DINO-based encoder with CLIP’s vision-language alignment benefits.
- While frozen DINO mean-pooling across multi-view images performs reasonably well, adapting beyond the frozen setup leads to severe overfitting to average patterns of known classes.
- To improve robustness, DEC adds a Chunking and Adapting Module (CAM) that splits multi-view inputs into chunks and dynamically integrates local view relationships instead of using simple pooling.
- To reduce bias toward known categories, DEC further proposes Virtual Feature Synthesis (VFS), which uses CLIP to generate virtual features for unseen classes and trains the system to use them.
- Experiments on standard open-set 3DOR benchmarks show DEC achieves stronger open-set discrimination performance than prior approaches.
Related Articles
I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.
Reddit r/artificial
Deepseek V4 Flash and Non-Flash Out on HuggingFace
Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API
Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means
Dev.to

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering
Dev.to