TrajLoom: Dense Future Trajectory Generation from Video

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • TrajLoom is an arXiv framework for predicting future dense point trajectories (including visibility) from observed video context and past trajectories, targeting motion forecasting and controllable video generation.
  • The method combines three key modules: Grid-Anchor Offset Encoding to reduce spatial bias, a TrajLoom-VAE that learns a compact spatiotemporal latent space via masked reconstruction and consistency regularization, and a TrajLoom-Flow that generates future trajectories in latent space using flow matching with boundary cues and K-step on-policy fine-tuning for stability.
  • The paper introduces TrajLoomBench, a unified benchmark covering both real and synthetic videos under a standardized evaluation setup aligned with video-generation benchmarks.
  • Compared with prior state-of-the-art approaches, TrajLoom extends prediction horizon from 24 to 81 frames while improving motion realism and stability across multiple datasets, and its outputs can be used directly for downstream video generation and editing.
  • Code, model checkpoints, and datasets are released via the project website, enabling replication and further research development.

Abstract

Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.