Lighting-grounded Video Generation with Renderer-based Agent Reasoning

arXiv cs.CV / 2026/4/10

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • LiVERは、拡散モデルによる動画生成の“制御性”問題に対し、レイアウト・照明・カメラ軌道などの3Dシーン要素を明示的に条件付けして生成を行うフレームワークです。
  • 3D表現からレンダリングしたコントロール信号によってシーン要素を分解(disentangle)し、従来よりも実制作向けのきめ細かな編集可能性を狙っています。
  • 多数のオブジェクト配置・照明・カメラパラメータを高密度でアノテーションした新しい大規模データセットを導入し、学習を支える基盤を整えています。
  • 軽量な条件付けモジュールと段階的学習(progressive training)で基盤となる動画拡散モデルへの統合を安定化し、高い写実性と時間的整合性(temporal consistency)を報告しています。
  • さらに、ユーザの高レベル指示を必要な3Dコントロール信号へ自動変換する“scene agent”を開発し、イメージtoビデオ/ビデオtoビデオでの3D編集をより使いやすくしています。

Abstract

Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.