PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • PinpointQA is presented as the first dataset and benchmark specifically targeting small object-centric spatial understanding in indoor videos, focused on precise target localization and positional description.
  • The benchmark includes 1,024 scenes and 10,094 QA pairs derived from ScanNet++ and ScanNet200, structured into four progressively harder tasks: TPV, NRI, FSD, and SSP.
  • QA generation leverages intermediate spatial representations with automated creation plus quality control refinement to improve reliability for evaluation.
  • Experiments with representative multimodal LLMs show a consistent performance gap across the task progression, with Structured Spatial Prediction (SSP) proving especially challenging.
  • Supervised fine-tuning on PinpointQA delivers substantial gains, indicating the dataset is useful both as a diagnostic benchmark and as training data for improving downstream spatial reasoning.

Abstract

Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.