Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

arXiv cs.CV / 3/25/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes ClusterSTM, a Cluster-Wise Spatio-Temporal Masking method aimed at making large-scale video-language pretraining more computationally efficient.
ClusterSTM addresses two key issues in prior masked video modeling: excessive visual information loss at high masking ratios and temporal information leakage from inter-frame correlations.
It works by first performing intra-frame clustering to group visual tokens into semantically independent clusters, then applying cluster-wise masking that retains the token with the highest temporal density per cluster.
The approach is reinforced by a video-text relevance reconstruction objective designed to align high-level multimodal semantics beyond standard visual reconstruction.
Experiments across multiple benchmarks show improved performance on video-text retrieval, video question answering, and video captioning, reported as new state-of-the-art results among efficient video-language models.

Abstract

Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Dev.to

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

Dev.to

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

Reddit r/artificial

Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer