Automatic End-to-End Data Integration using Large Language Models
arXiv cs.CL / 3/12/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- The paper demonstrates an automatic end-to-end data integration pipeline that uses GPT-5.2 to generate all artifacts needed to adapt pipelines to specific use cases.
- It generates artifacts including schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict-resolution heuristics in data fusion.
- In three case studies (video game, music, and company data), the LLM-based pipeline achieves results similar to or better than human-designed pipelines, with end-to-end datasets of comparable size and density.
- The approach costs about $10 per case study to configure, far cheaper than hiring human data engineers.




