Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Uni-ViGUは、動画では生成が理解より計算コスト高いという不均衡に着目し、理解中心のマルチモーダルLLMを拡張するのではなく「動画生成器」を基盤に統合する枠組みを提案しています。
単一のプロセスで動画は連続フローマッチング、テキストは離散フローマッチングを扱う「統一フロー方式」により、動画とテキストのコヒーレントなマルチモーダル生成を可能にしています。
Modality-driven MoE（Mixture of Experts）を用いてTransformerブロックへ軽量層を追加しつつ、テキスト生成も行える構造を採用して、生成の事前知識（generative priors）を保持する方針です。
生成知識を理解へ転用するために、Knowledge Recall（プロンプト再構成）とCapability Refinement（詳細キャプションでの微調整）の2段階の双方向トレーニングを設計し、理解側でも共有表現を学習します。

Abstract

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Key Points

Abstract

Related Articles

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

can we talk about how AI has gotten really good at lying to you?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer