Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

arXiv cs.CV / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a dynamic cluster-based sampling method (DynamiCS) to reduce the training compute cost of vision-language models by controlling how training data is sampled.
Unlike earlier approaches that focus on balancing semantic topic distributions, DynamiCS explicitly addresses the risk that efficient downsampling can undercut representation of rare (long-tail) concepts.
DynamiCS downsamples large semantic clusters and upsamples smaller ones, applying this sampling anew at every epoch to remain dynamic during training.
The authors report that DynamiCS preserves the relative ordering of semantic clusters while emphasizing long-tail concepts, leading to better performance on long-tail instances.
Experiments indicate DynamiCS both lowers overall VLM training cost and improves accuracy for long-tail concepts compared with approaches that mainly flatten semantic distributions.

Abstract

The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emph{long-tail concepts} remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emph{dynamic cluster-based sampling approach (DynamiCS)} that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Dev.to

I deployed AI agents across AWS, GCP, and Azure without a VPN. Here is how it works.

Dev.to

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Dev.to

AI made learning fun again

Dev.to

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Dev.to

Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

Key Points

Abstract

Related Articles

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

I deployed AI agents across AWS, GCP, and Azure without a VPN. Here is how it works.

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

AI made learning fun again

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer