Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

arXiv cs.CV / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a dynamic cluster-based sampling method (DynamiCS) to reduce the training compute cost of vision-language models by controlling how training data is sampled.
  • Unlike earlier approaches that focus on balancing semantic topic distributions, DynamiCS explicitly addresses the risk that efficient downsampling can undercut representation of rare (long-tail) concepts.
  • DynamiCS downsamples large semantic clusters and upsamples smaller ones, applying this sampling anew at every epoch to remain dynamic during training.
  • The authors report that DynamiCS preserves the relative ordering of semantic clusters while emphasizing long-tail concepts, leading to better performance on long-tail instances.
  • Experiments indicate DynamiCS both lowers overall VLM training cost and improves accuracy for long-tail concepts compared with approaches that mainly flatten semantic distributions.

Abstract

The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emph{long-tail concepts} remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emph{dynamic cluster-based sampling approach (DynamiCS)} that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.