DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

arXiv cs.LG / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces DataFlex, a unified framework for data-centric dynamic training of large language models that standardizes data selection, mixture adjustment, and sample reweighting within one extensible system.
DataFlex is designed as a drop-in replacement compatible with the standard LLM training workflow based on LLaMA-Factory, including reusable trainer abstractions and modular components.
It unifies model-dependent operations such as embedding extraction, inference, and gradient computation, and supports large-scale training setups including DeepSpeed ZeRO-3.
Experiments show dynamic data selection can outperform static full-data training on MMLU for Mistral-7B and Llama-3.2-3B, while data mixture methods like DoReMi and ODM improve both MMLU and corpus-level perplexity for Qwen2.5-1.5B.
The authors report DataFlex delivers consistent runtime improvements over original implementations and aims to improve reproducibility and fair comparison across data-centric methods.

Abstract

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer