Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

arXiv cs.CV / 5/6/2026

📰 NewsModels & Research

Key Points

  • Traditional video object-centric learning enforces temporal consistency by training learned dynamics modules that predict future object slots, but the work argues these predictors are effectively costly approximations of discrete correspondence.
  • The paper shows that modern self-supervised vision backbones already provide instance-discriminative features, allowing temporal prediction to be unnecessary for identity consistency.
  • It proposes Grounded Correspondence, which maintains frame-to-frame identity using deterministic bipartite matching (Hungarian matching) over slot representations instead of learned transition functions.
  • Slots are initialized from salient regions using frozen backbone features, and the method uses zero learnable parameters for temporal modeling while still achieving competitive results on MOVi-D, MOVi-E, and YouTube-VIS.

Abstract

The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/