CRAFT: Clustered Regression for Adaptive Filtering of Training data

arXiv cs.AI / 4/27/2026

💬 OpinionModels & Research

共有:

Key Points

The paper introduces CRAFT, a vectorization-agnostic method for selecting a small, high-quality subset of training data to make fine-tuning on very large corpora more efficient.
CRAFT uses a two-stage process: it first matches the validation source distribution by allocating a proportional budget across k-means clusters, then selects within each cluster the training pairs whose target embeddings minimize a conditional expected distance based on the validation target distribution.
The authors provide theoretical guarantees showing that proportional cluster allocation bounds the continuous KL divergence between the selected and validation distributions, with remaining error controlled by cluster diameters.
Experiments on English–Hindi translation (selecting from 33M NLLB pairs and fine-tuning mBART with LoRA) show CRAFT reaching 43.34 BLEU, outperforming TSDS by 2.13 BLEU on the same setup while running over 40× faster; with TF-IDF features the full selection pipeline can finish in under a minute on CPU.

Abstract

Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.