Motif-Video 2B: Technical Report

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The technical report introduces Motif-Video 2B, aiming to achieve strong text-to-video generation quality with a much smaller budget (under 10M clips and under 100,000 H200 GPU hours).
The core approach is architectural specialization: separating prompt alignment, temporal consistency, and fine-detail recovery instead of forcing them through a single shared pathway.
Motif-Video 2B uses Shared Cross-Attention to maintain strong text control over long video token sequences, alongside a three-part backbone for early fusion, joint representation learning, and later detail refinement.
An efficiency-focused training recipe combines dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder, and analysis finds improved cross-frame attention structure versus standard single-stream baselines.
On VBench, Motif-Video 2B attains 83.76%, outperforming Wan2.1 14B while using 7× fewer parameters and substantially less training data, indicating that architecture and training efficiency can reduce the quality gap to larger models.

Abstract

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7

\times

fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.