FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder
arXiv cs.CV / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces FusionBERT, a multi-view image-to-3D retrieval framework designed to improve cross-modal matching beyond single-view image–3D alignment.
- FusionBERT includes a cross-attention-based multi-view visual aggregator that adaptively fuses complementary information across multiple image viewpoints to produce a more robust fused visual feature.
- It also proposes a normal-aware 3D encoder that jointly models point normals and 3D positions to strengthen geometric representation, particularly for textureless or color-degraded 3D models.
- Experiments on image–3D retrieval show significantly higher accuracy than state-of-the-art multimodal large models in both single-view and multi-view settings, positioning FusionBERT as a strong new baseline.
Related Articles

Black Hat Asia
AI Business

How Bash Command Safety Analysis Works in AI Systems
Dev.to

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide
Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)
Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App
Dev.to