ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation
arXiv cs.CV / 3/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- ShotVerse proposes a Plan-then-Control framework that decouples text-to-video generation into a VLM-based Planner and a Controller to produce camera trajectories and render multi-shot cinematic content from text.
- The approach is grounded in a data-centric paradigm that treats aligned (Caption, Trajectory, Video) triplets as a joint distribution to connect automated planning with precise execution.
- It includes an automated multi-shot camera calibration pipeline that aligns disjoint single-shot trajectories into a unified global coordinate system and introduces the ShotVerse-Bench dataset with a three-track evaluation protocol.
- Experiments demonstrate that ShotVerse delivers camera-accurate, cross-shot consistent multi-shot videos with improved cinematic aesthetics, bridging unreliable textual control and labor-intensive manual plotting.




