Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FI3Det, a framework for few-shot incremental 3D object detection that targets dynamic indoor environments where new object classes appear over time.
  • FI3Det uses vision-language models (VLMs) in a base stage to mine “unknown” objects and learn representations for unseen categories, including 2D semantic features and class-agnostic 3D bounding boxes.
  • To reduce noise in mined representations, it adds a weighting mechanism that re-weights point- and box-level contributions based on spatial location and feature consistency.
  • For classification, FI3Det proposes gated multimodal prototype imprinting by aligning 2D semantic and 3D geometric features to form prototypes and fuse multimodal classification signals for novel detection.
  • Experiments with batch and sequential evaluation on ScanNet V2 and SUN RGB-D show consistent improvements over baseline methods, and the authors provide code on GitHub.

Abstract

Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.