GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models
arXiv cs.CV / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- GAP-MLLM proposes Geometry-Aligned Pre-training to activate 3D geometric representations in multimodal LLMs, addressing limitations of image-only inputs.
- The authors argue the remaining gap in 3D perception is due to misalignment in the training paradigm, not a lack of geometric priors.
- It introduces a visual-prompted joint task forcing MLLMs to predict sparse pointmaps alongside semantic labels to enforce geometric awareness.
- It includes a multi-level progressive fusion module with token-level gating to adaptively fuse geometric priors without suppressing semantic reasoning.
- Experiments show improved geometric feature fusion and performance gains across 3D visual grounding, 3D dense captioning, and 3D video object detection.




