Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Director,” a unified spatio-temporal Gaussian representation designed for dynamic scenes, aiming to combine high-fidelity 4D rendering with instance-level semantics for more robust understanding and tracking.
  • It improves semantic consistency by supervising Gaussian-level learnable features using temporally aligned instance masks and sentence embeddings from multimodal large language models, with two MLP decoders to support identity consistency over time.
  • To reduce temporal drift and improve stability, the method integrates 2D optical flow with 4D Gaussians and fine-tunes their motion, using the resulting alignment to provide more reliable initialization.
  • Training further incorporates geometry-aware SDF constraints and regularization terms that enforce surface continuity, targeting better temporal coherence in dynamic foreground modeling.
  • Experiments report that Director produces temporally coherent 4D reconstructions while enabling instance segmentation and open-vocabulary (language-conditioned) querying of the scene.

Abstract

Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.