PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

arXiv cs.CV / 4/8/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • PanopticQueryは、動的な4Dシーンを自然言語で問い合わせる際に、空間・時間・視点をまたいだ意味づけ(セマンティック・グラウンディング)を統一的に行う枠組みを提案しています。
  • 4D Gaussian Splattingによる高忠実な動的再構成に加え、多視点・複数時刻の2Dセマンティック予測を合意形成(コンセンサス)して一貫性のない出力を除去し、幾何学整合性を保ちながら4Dの構造化グラウンディングへ引き上げます。
  • これにより、属性だけでなく、時間的な行為(アクション)、空間関係、複数物体の相互作用といった複雑なセマンティクスを扱うことを目指しています。
  • 評価のために新ベンチマークPanoptic-L4Dを導入し、複雑な言語クエリにおいて従来手法を上回るSOTA結果を示したと報告しています。

Abstract

Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.