SpatialPoint: Spatial-aware Point Prediction for Embodied Localization
arXiv cs.AI / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces “embodied localization,” defined as predicting executable 3D points from visual observations plus language instructions for embodied agents acting in 3D space.
- It distinguishes two target types for the task—touchable (surface-grounded) 3D points for physical interaction and air (free-space) 3D points for placement, navigation, and geometric/directional constraints.
- SpatialPoint is proposed as a spatial-aware vision-language framework that explicitly integrates structured depth into a VLM and outputs camera-frame 3D coordinates rather than relying on implicit geometric reconstruction from RGB.
- The authors build a large 2.6M-sample RGB-D dataset with QA pairs covering both touchable and air points for training and evaluation.
- Experiments and real-robot deployment across grasping, object placement, and mobile navigation show that incorporating depth into VLMs significantly improves embodied localization performance.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



