DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

arXiv cs.CV / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces DenseStep2M, a training-free pipeline that automatically extracts high-quality, temporally grounded procedural step annotations from in-the-wild instructional videos.
  • It addresses key dataset noise issues such as inaccurate ASR transcripts and inconsistent narrator–video temporal alignment by segmenting videos into shots, filtering misaligned content, and using multimodal/LLM tools (Qwen2.5-VL and DeepSeek-R1) to produce structured steps.
  • DenseStep2M scales to about 100K videos and 2M detailed steps, and the authors also create the DenseCaption100 benchmark with human-written captions to evaluate alignment quality.
  • Experiments show strong agreement between generated steps and human annotations, and demonstrate improvements on downstream tasks including dense video captioning, procedural step grounding, and cross-modal retrieval, with good zero-shot generalization across different camera perspectives.
  • The dataset is released publicly on Hugging Face to support long-form, long-term video understanding and multimodal alignment research.

Abstract

Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at https://huggingface.co/datasets/mingjige/DenseStep2M.