CRAFT: Clustered Regression for Adaptive Filtering of Training data
arXiv cs.AI / 4/27/2026
💬 OpinionModels & Research
Key Points
- The paper introduces CRAFT, a vectorization-agnostic method for selecting a small, high-quality subset of training data to make fine-tuning on very large corpora more efficient.
- CRAFT uses a two-stage process: it first matches the validation source distribution by allocating a proportional budget across k-means clusters, then selects within each cluster the training pairs whose target embeddings minimize a conditional expected distance based on the validation target distribution.
- The authors provide theoretical guarantees showing that proportional cluster allocation bounds the continuous KL divergence between the selected and validation distributions, with remaining error controlled by cluster diameters.
- Experiments on English–Hindi translation (selecting from 33M NLLB pairs and fine-tuning mBART with LoRA) show CRAFT reaching 43.34 BLEU, outperforming TSDS by 2.13 BLEU on the same setup while running over 40× faster; with TF-IDF features the full selection pipeline can finish in under a minute on CPU.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to