Video Analysis and Generation via a Semantic Progress Function

arXiv cs.CV / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a common issue in image/video generation where semantic meaning stays nearly constant for a while and then changes abruptly, creating non-linear evolution over time.
It introduces a one-dimensional Semantic Progress Function that models how meaning shifts across a sequence by measuring semantic embedding distances per frame and fitting a smooth cumulative curve.
Deviations from a straight-line semantic progress indicate uneven “semantic pacing,” which the authors use to diagnose and analyze temporal irregularities in generated videos.
Using this metric, the paper proposes a semantic linearization method that re-timesteps/reparameterizes a sequence so semantic change occurs at a constant rate, improving smoothness and coherence.
The framework is presented as model-agnostic, enabling comparisons of pacing across different generators and steering both real and generated video toward user-defined target pacing.

Abstract

Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.