Scaling Generalist Data-Analytic Agents
arXiv cs.CL / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- DataMind is proposed as a scalable data synthesis and agent-training recipe to build generalist data-analytic agents, addressing the limitations of open-source models on diverse data formats and long-horizon reasoning.
- The approach includes a fine-grained task taxonomy with recursive easy-to-hard composition, a knowledge-augmented trajectory sampling strategy with model- and rule-based filtering, a memory-efficient multi-turn rollout framework, and a training objective that mixes supervised fine-tuning and reinforcement learning.
- On DataMind-12K data, DataMind-14B achieves state-of-the-art performance across multiple data analysis benchmarks, outperforming proprietary baselines such as DeepSeek-V3.1 and GPT-5, while DataMind-7B remains the top-performing open-source model.
- The authors plan to release DataMind-12K and DataMind-7B,14B to the community to support future research and evaluation.
- They also offer empirical insights from exploratory trials to guide agentic training for researchers and practitioners.




