StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention

arXiv cs.CV / 3/23/2026

📰 NewsModels & Research

Key Points

  • StreetForward introduces a pose-free, tracker-free feedforward framework for dynamic street reconstruction in autonomous driving, enabling rapid scene reconstruction without per-scene optimization.
  • It augments the Visual Geometry Grounded Transformer with a temporal mask attention module to extract motion information from image sequences and produce motion-aware latent representations.
  • Static content and dynamic instances are represented using 3D Gaussian Splatting and jointly optimized through cross-frame rendering with spatio-temporal consistency, allowing per-pixel velocity estimation and high-fidelity novel view synthesis at new poses and times.
  • Trained on the Waymo Open Dataset, StreetForward demonstrates superior performance on novel view synthesis and depth estimation compared with existing methods and shows zero-shot generalization on CARLA and other datasets.

Abstract

Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction. Building upon the alternating attention mechanism from Visual Geometry Grounded Transformer (VGGT), we propose a simple yet effective temporal mask attention module that captures dynamic motion information from image sequences and produces motion-aware latent representations. Static content and dynamic instances are represented uniformly with 3D Gaussian Splatting, and are optimized jointly by cross-frame rendering with spatio-temporal consistency, allowing the model to infer per-pixel velocities and produce high-fidelity novel views at new poses and times. We train and evaluate our model on the Waymo Open Dataset, demonstrating superior performance on novel view synthesis and depth estimation compared to existing methods. Furthermore, zero-shot inference on CARLA and other datasets validates the generalization capability of our approach. More visualizations are available on our project page: https://streetforward.github.io.