EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

EFlowは、動画拡散トランスフォーマーのボトルネックである「注意（attention）の計算コスト」と「反復サンプリング手数」を同時に緩和する、少数ステップで学習・推論する枠組みを提案しています。
サンプリング手数削減のために、時刻tのノイズ状態から時刻sを写像する「solution-flow objective」を用い、動画スケールでも計算可能かつ高品質にするための工夫が示されています。
効率性と安定性のために、ランダムなトークンドロップに強い「Gated Local-Global Attention（トークンを落とせるハイブリッド注意ブロック）」を導入しています。
学習レシピとしては、ガイダンス目標を安価な「弱い経路（weak path）」に置き換える「Path-Drop Guided training」と、極小ステップでも忠実性を保つ「Mean-Velocity Additivity regularizer」を組み合わせています。
提案により、従来のsolution-flowに対して学習スループット最大2.5倍、標準的な反復モデル比で推論レイテンシを45.3倍低減しつつ、Kineticsや大規模テキスト・トゥ・ビデオで競争力のある性能を目指すとしています。

Abstract

Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.