Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval
arXiv cs.CV / 4/22/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes Diff-SBSR, the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval, targeting the hard zero-shot setting caused by missing category supervision and sparse sketch inputs.
- It uses a frozen Stable Diffusion backbone to extract multimodal, discriminative features from intermediate U-Net layers for both sketch inputs and rendered 3D views, leveraging diffusion models’ open-vocabulary capability and shape bias.
- To mitigate sketches’ abstraction/sparsity and the domain gap versus natural images without expensive retraining, the method conditions the frozen diffusion model with CLIP-derived visual features and enriched textual guidance from BLIP via soft prompts plus hard descriptions.
- It introduces Circle-T loss to adaptively strengthen attraction between positive sketch–3D pairs once negatives are sufficiently separated, improving alignment under sketch noise.
- Experiments on two public benchmarks show Diff-SBSR consistently outperforms prior state-of-the-art methods for zero-shot sketch-to-3D retrieval.


