GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models
arXiv cs.CV / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- GAP-MLLM proposes Geometry-Aligned Pre-training to activate 3D geometric representations in multimodal LLMs, addressing limitations of image-only inputs.
- The authors argue the remaining gap in 3D perception is due to misalignment in the training paradigm, not a lack of geometric priors.
- It introduces a visual-prompted joint task forcing MLLMs to predict sparse pointmaps alongside semantic labels to enforce geometric awareness.
- It includes a multi-level progressive fusion module with token-level gating to adaptively fuse geometric priors without suppressing semantic reasoning.
- Experiments show improved geometric feature fusion and performance gains across 3D visual grounding, 3D dense captioning, and 3D video object detection.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
Stop Treating AI Interview Fraud Like a Proctoring Problem
Dev.to
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
The 5 software development trends that actually matter in 2026 (and what they mean for your startup)
Dev.to
InVideo AI Review: Fast Finished
Dev.to