{\Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer

arXiv cs.RO / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces {Ψ}-Map, a framework for open-vocabulary panoptic reconstruction aimed at enabling more reliable real-to-sim transfer for robotics perception and simulation.
  • It uses LiDAR-driven plane-constrained multimodal Gaussian Mixture Models and 2D Gaussian surfels to improve surface alignment and provide continuous geometric supervision in large-scale environments.
  • To reduce error accumulation from multi-stage pipelines, it proposes a query-guided end-to-end architecture that lifts 2D mask features into 3D via local cross-attention within the view frustum for globally consistent panoptic understanding.
  • It improves real-time performance by optimizing rendering with Precise Tile Intersection and a Top-K hard selection strategy for semantic computation.
  • Experiments report better geometric and panoptic reconstruction quality while sustaining over 40 FPS inference, targeting robotic control loop real-time needs.

Abstract

Open-vocabulary panoptic reconstruction is essential for advanced robotics perception and simulation. However, existing methods based on 3D Gaussian Splatting (3DGS) often struggle to simultaneously achieve geometric accuracy, coherent panoptic understanding, and real-time inference frequency in large-scale scenes. In this paper, we propose a comprehensive framework that integrates geometric reinforcement, end-to-end panoptic learning, and efficient rendering. First, to ensure physical realism in large-scale environments, we leverage LiDAR data to construct plane-constrained multimodal Gaussian Mixture Models (GMMs) and employ 2D Gaussian surfels as the map representation, enabling high-precision surface alignment and continuous geometric supervision. Building upon this, to overcome the error accumulation and cumbersome cross-frame association inherent in traditional multi-stage panoptic segmentation pipelines, we design a query-guided end-to-end learning architecture. By utilizing a local cross-attention mechanism within the view frustum, the system lifts 2D mask features directly into 3D space, achieving globally consistent panoptic understanding. Finally, addressing the computational bottlenecks caused by high-dimensional semantic features, we introduce Precise Tile Intersection and a Top-K Hard Selection strategy to optimize the rendering pipeline. Experimental results demonstrate that our system achieves superior geometric and panoptic reconstruction quality in large-scale scenes while maintaining an inference rate exceeding 40 FPS, meeting the real-time requirements of robotic control loops.