INSIGHT: Indoor Scene Intelligence from Geometric-Semantic Hierarchy Transfer for Public~Safety

arXiv cs.CV / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper proposes INSIGHT, a pipeline to give public-safety indoor scenes machine-readable “spatial intelligence” by transferring 2D understanding into 3D metric space using registered RGB-D data.
  • It addresses two key challenges in prior public-safety 3D segmentation work: limited labeled indoor training data and weak recognition of small, safety-critical features in native point-cloud methods.
  • INSIGHT uses interchangeable 2D vision stacks—one based on a SAM3 text-prompted segmentation foundation-model approach and another using traditional CV components (open-set detection, VQA, OCR)—while sharing a common 3D back end.
  • The method is evaluated on all seven subareas of Stanford 2D-3D-S, producing Pointcept-schema-compatible labeled point clouds and ISO 19164-compliant scene graphs with very high compression (about 10,000×) suitable for field deployment.
  • Results report per-point labeling accuracy on shared classes, detection sensitivity for 15 safety-critical classes not present in public 3D benchmarks, and complementary behavior between the two pipelines.

Abstract

Indoor environments lack the spatial intelligence infrastructure that GPS provides outdoors; first responders arriving at unfamiliar buildings typically have no machine-readable map of safety equipment. Prior work on 3D semantic segmentation for public safety identified two barriers: scarcity of labeled indoor training data and poor recognition of small safety-critical features by native point-cloud methods. This paper presents INSIGHT, a zero-target-domain-annotation pipeline that projects 2D image understanding into 3D metric space via registered RGB-D data. Two interchangeable vision stacks share a common 3D back end: a SAM3 foundation-model stack for text-prompted segmentation, and a traditional CV stack (open-set detection, VQA, OCR) whose intermediate outputs are independently inspectable. Evaluated on all seven subareas of Stanford 2D-3D-S (70{,}496 images), the pipeline produces Pointcept-schema-compatible labeled point clouds and ISO~19164-compliant scene graphs with {\sim}10^{4}{\times} compression; role-filtered payloads transmit in {<}15\,s at 1\,Mbps over FirstNet Band~14. We report per-point labeling accuracy on 7 shared classes, detection sensitivity for 15 safety-critical classes absent from public 3D benchmarks alongside code-capped deployable estimates, and inter-pipeline complementarity, demonstrating that 2D-to-3D semantic transfer addresses the labeled-data bottleneck while scene graphs provide building intelligence compact enough for field deployment.