All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

arXiv cs.CV / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a unified synthetic data pipeline that automatically generates large volumes of multimodal video data with diverse, rich supervision for multiple video-understanding tasks.
  • It supports multiple task formats in one framework, aiming to make data creation scalable and consistent across tasks like object counting, video question answering, and video object segmentation.
  • To improve reasoning and visual grounding, the authors introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about video content rather than depending only on captions or generic instructions.
  • Experiments across three benchmark tasks show that models trained largely on synthetic data can generalize well to real-world datasets and often outperform models trained with more traditional real-data annotation approaches.

Abstract

Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.