From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tackles exo-to-ego video generation, where a first-person video is synthesized from a synchronized third-person view plus camera poses, but notes that synchronization creates spatio-temporal and geometric discontinuities that break assumptions of standard benchmarks.
It identifies the “synchronization-induced jump” as the core problem and proposes Syn2Seq-Forcing, which reframes the task as sequential signal modeling by interpolating between source and target videos to produce one continuous signal.
Using this sequential formulation, diffusion-based sequence models such as Diffusion Forcing Transformers (DFoT) can learn more coherent frame-to-frame transitions.
Experiments indicate that interpolating only the videos (without interpolating poses) still yields substantial improvements, suggesting pose interpolation is not the dominant factor.
The approach is presented as a unifying framework that can support both Exo2Ego and Ego2Exo within a single continuous sequence model, enabling a more general foundation for future cross-view synthesis research.

Abstract

Exo-to-Ego video generation aims to synthesize a first-person video from a synchronized third-person view and corresponding camera poses. While paired supervision is available, synchronized exo-ego data inherently introduces substantial spatio-temporal and geometric discontinuities, violating the smooth-motion assumptions of standard video generation benchmarks. We identify this synchronization-induced jump as the central challenge and propose Syn2Seq-Forcing, a sequential formulation that interpolates between the source and target videos to form a single continuous signal. By reframing Exo2Ego as sequential signal modeling rather than a conventional condition-output task, our approach enables diffusion-based sequence models, e.g. Diffusion Forcing Transformers (DFoT), to capture coherent transitions across frames more effectively. Empirically, we show that interpolating only the videos, without performing pose interpolation already produces significant improvements, emphasizing that the dominant difficulty arises from spatio-temporal discontinuities. Beyond immediate performance gains, this formulation establishes a general and flexible framework capable of unifying both Exo2Ego and Ego2Exo generation within a single continuous sequence model, providing a principled foundation for future research in cross-view video synthesis.

Black Hat Asia

AI Business

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Dev.to

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

Dev.to

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

Key Points

Abstract

Related Articles

Black Hat Asia

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer