From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles exo-to-ego video generation, where a first-person video is synthesized from a synchronized third-person view plus camera poses, but notes that synchronization creates spatio-temporal and geometric discontinuities that break assumptions of standard benchmarks.
  • It identifies the “synchronization-induced jump” as the core problem and proposes Syn2Seq-Forcing, which reframes the task as sequential signal modeling by interpolating between source and target videos to produce one continuous signal.
  • Using this sequential formulation, diffusion-based sequence models such as Diffusion Forcing Transformers (DFoT) can learn more coherent frame-to-frame transitions.
  • Experiments indicate that interpolating only the videos (without interpolating poses) still yields substantial improvements, suggesting pose interpolation is not the dominant factor.
  • The approach is presented as a unifying framework that can support both Exo2Ego and Ego2Exo within a single continuous sequence model, enabling a more general foundation for future cross-view synthesis research.

Abstract

Exo-to-Ego video generation aims to synthesize a first-person video from a synchronized third-person view and corresponding camera poses. While paired supervision is available, synchronized exo-ego data inherently introduces substantial spatio-temporal and geometric discontinuities, violating the smooth-motion assumptions of standard video generation benchmarks. We identify this synchronization-induced jump as the central challenge and propose Syn2Seq-Forcing, a sequential formulation that interpolates between the source and target videos to form a single continuous signal. By reframing Exo2Ego as sequential signal modeling rather than a conventional condition-output task, our approach enables diffusion-based sequence models, e.g. Diffusion Forcing Transformers (DFoT), to capture coherent transitions across frames more effectively. Empirically, we show that interpolating only the videos, without performing pose interpolation already produces significant improvements, emphasizing that the dominant difficulty arises from spatio-temporal discontinuities. Beyond immediate performance gains, this formulation establishes a general and flexible framework capable of unifying both Exo2Ego and Ego2Exo generation within a single continuous sequence model, providing a principled foundation for future research in cross-view video synthesis.