Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

arXiv cs.CV / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a new orbital video generation method that produces geometrically realistic and long-range consistent view extrapolation from a single input image.
  • Instead of relying on pixel-wise attention for frame consistency, it conditions generation on rich 3D shape priors encoded by a 3D foundation generative model.
  • The approach uses two scales of latent features—global denoised structure guidance and view-dependent, fine-grained latent images projected from volumetric features—to better constrain rear-view synthesis.
  • A multi-scale 3D adapter injects these feature tokens into a base video model via cross-attention, enabling efficient inference and largely model-agnostic fine-tuning.
  • Experiments across multiple benchmarks reportedly show improved visual quality, shape realism, multi-view consistency, and robustness on complex camera trajectories and real-world (“in-the-wild”) images.

Abstract

We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.