Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
arXiv cs.AI / 4/29/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a core AI challenge: fine-tuning LLMs on domain corpora improves performance, but lacks feedback to diagnose which data issues cause failures on domain tasks.
- It proposes “Programming with Data,” mapping the data-engineering lifecycle to the software development lifecycle by using a structured knowledge representation as the shared basis for both training and evaluation.
- In this framework, training data acts like source code, model training corresponds to compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging that targets specific concept gaps and reasoning-chain breaks.
- The authors report that iterative repair cycles yield consistent improvements across different model scales and architectures while preserving general capabilities, and they release open resources including a structured knowledge base, benchmark suite, and training corpus.
- They demonstrate the approach across sixteen disciplines spanning natural sciences, engineering, biomedicine, and social sciences, aiming to make the link between training data and model behavior reliably traceable and systematically fixable.


