Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

arXiv cs.CV / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper tackles subject-driven video generation by proposing a zero-shot approach that avoids per-subject tuning and does not require large-scale subject–video training pairs.
  • It decomposes the task into learning subject identity injection from subject-image pairs and preserving motion characteristics using only a small set of arbitrary videos.
  • The method uses stochastic optimization with random reference-frame sampling and image-token dropout to reduce trivial first-frame copying and improve generalization.
  • Experiments with CogVideoX-5B show that adapting a single model with 200K subject-image pairs and 4,000 arbitrary videos can be done in 288 A100 GPU hours—about 1% of the compute of prior zero-shot baselines—while remaining competitive on subject fidelity and motion quality.
  • The authors report that the same recipe also transfers to Wan 2.2-5B, suggesting broader applicability across video generation model families.

Abstract

Subject-driven video generation (SDV-Gen) aims to produce videos of a specific subject by adapting a pretrained video model, enabling personalized and application-driven content creation. To achieve this goal, per-subject tuning methods require approximately 200 A100 GPU hours to generate a customized video, whereas zero-shot methods avoid per-subject tuning but typically rely on millions of subject-video pairs for the supervision, incurring massive network fine-tuning costs (10K-200K A100 GPU hours). We propose a data- and compute-efficient zero-shot SDV-Gen framework that avoids test-time per-subject tuning and the use of large-scale subject-video pairs. Our key idea decomposes SDV-Gen into (i) identity injection learned from subject-image pairs and (ii) motion-awareness preservation maintained by a small set of arbitrary videos. We optimize the two tasks with stochastic switching, using random reference-frame sampling and image-token dropout to prevent trivial first-frame copying. Our gradient analysis shows that the two objectives rapidly evolve toward nearly orthogonal update subspaces, explaining the stable optimization. Using CogVideoX-5B, we adapt a single model with 200K subject-image pairs and 4,000 arbitrary videos in 288 A100 GPU hours. This yields about 1% of compute compared to prior zero-shot baselines (i.e., 0.4% of VACE and 2.8% of Phantom) while using no subject-video pairs, yet remaining competitive in subject fidelity and motion quality. We show that the same recipe transfers to Wan 2.2-5B.