Towards Visual Query Localization in the 3D World

arXiv cs.CV / 5/5/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces 3DVQL, the first benchmark targeting visual query localization (VQL) in 3D, where the system predicts the spatio-temporal position of the most recent relevant event in a sequence based on a query.
  • 3DVQL includes 2,002 sequences totaling about 170,000 frames, with annotations covering 6.4K response track segments across 38 object categories, and provides multiple input modalities (point clouds, RGB images, and depth).
  • The dataset’s annotations are manually produced with multiple rounds of verification and refinement to improve label quality.
  • The authors provide representative 3D multimodal VQL baseline models and find that performance varies substantially depending on the chosen fusion module.
  • They propose a lift-and-attention fusion method (LaF), which delivers significantly better results than existing baselines, and plan to publicly release the benchmark and code.

Abstract

Visual query localization (VQL) aims to predict the spatio-temporal response of the most recent occurrence in a sequence given a query. Currently, most research focuses on visual query localization in 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to address visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities, including point clouds, RGB images, and depth images, to support flexible research. To ensure high-quality annotations, each sequence is manually annotated with multiple rounds of verification and refinement. To the best of our knowledge, 3DVQL is the first benchmark for 3D multimodal visual query localization. To facilitate comparison in subsequent research, we implement a series of representative 3D multimodal VQL baselines using point clouds and RGB images. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift-and-attention fusion algorithm named LaF, which significantly outperforms existing baseline models. Our benchmark and model will be publicly released at https://github.com/wuhengliangliang/3DVQL.