ClimateVID -- Social Media Videos Analysis and Challenges Involved

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies automated visual theme detection for short social-media videos, examining both zero-shot classification and unsupervised clustering to reveal patterns in public discourse.
  • It benchmarks several VLMs (VideoChatGPT, PandaGPT, VideoLLaVA) against a frame-wise CLIP baseline to assess how well these systems can identify visual themes without task-specific training.
  • Because current VLMs cannot reliably detect climate-change-specific classes, the authors shift focus to clustering using image-embedding models to analyze which visual frames group together.
  • The clustering approach is formulated as a minimum-cost multicut problem, and the study reports that ConvNeXt V2 and DINOv2 generate meaningful clusters with different clustering characteristics.
  • The work includes extensive evaluations and practical guidance, and it provides open-source code via a linked GitHub repository.

Abstract

The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we advance automated visual theme detection by assessing zero-shot and clustering capabilities on social media data. (1) We evaluated the capabilities of notable VLMs such as VideoChatGPT, PandaGPT, and VideoLLava using zero-shot image classification and compared their performance to the baseline provided by frame-wise CLIP image classification. (2) By treating clustering as a minimum cost multicut problem, we aim to uncover insightful patterns in an unsupervised manner. For both analysis strategies, we provide extensive evaluations and practical guidance to practitioners. While VLMs are currently not able to detect climate change specific classes, the clustering results are distinct visual frames. %Given that VLMs are not currently capable to grasp the climate change discourse, we focus the clustering evaluation of image embedding models. We find that both ConvNeXt V2 and DINOv2 produce meaningful clusters, with DINOv2 focusing more on style differences and abstract categories, while ConvNeXt V2 clusters differ in more fine-grained ways. Code available at https://github.com/KathPra/ClimateVID.git.