VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

arXiv cs.CV / 5/6/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces VL-SAM-v3, a unified approach to open-world object detection that works for both open-vocabulary and open-ended settings.
  • Instead of relying mainly on coarse text semantics and parametric knowledge, VL-SAM-v3 retrieves external visual prototypes from a non-parametric memory bank to build more reliable visual priors.
  • It transforms retrieved prototypes into two complementary priors: sparse priors for instance-level spatial anchoring and dense priors for class-aware local context.
  • The method integrates these priors into detection through Memory-Guided Prompt Refinement, using a shared retrieval-and-refinement mechanism during inference.
  • Zero-shot experiments on LVIS show consistent improvements in detection, with especially large gains for rare categories, and results with a stronger open-vocabulary detector (SAM3) confirm the generality of the retrieval-refinement design.

Abstract

Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference.Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories.Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.