LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing

arXiv cs.CV / 4/21/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces LIVE, a joint training framework that uses large-scale, high-quality image editing data together with video datasets to improve instruction-based video editing despite expensive video annotations.
  • To address the static-image vs dynamic-video mismatch, LIVE applies a frame-wise token noise strategy and leverages pretrained video generative models to produce plausible temporal changes.
  • It uses dataset cleaning plus an automated data pipeline and a two-stage training approach to gradually “anneal” video-editing abilities.
  • The authors build a new evaluation benchmark with 60+ difficult tasks common in image editing but underrepresented in existing video datasets, reporting state-of-the-art results via comparisons and ablations.
  • The source code is planned to be publicly released, enabling further research and replication.

Abstract

Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread attention, constructing paired video editing data through video generation or editing models. However, compared to image editing, the high annotation costs of video data severely constrain the scale, quality, and task diversity of video editing datasets when relying on video generative models or manual annotation. To bridge this gap, we propose LIVE, a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, we introduce a frame-wise token noise strategy, which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Moreover, through cleaning public datasets and constructing an automated data pipeline, we adopt a two-stage training strategy to anneal video editing capabilities. Furthermore, we curate a comprehensive evaluation benchmark encompassing over 60 challenging tasks that are prevalent in image editing but scarce in existing video datasets. Extensive comparative and ablation experiments demonstrate that our method achieves state-of-the-art performance. The source code will be publicly available.