Unlocking the Power of Large Language Models for Multi-table Entity Matching

arXiv cs.CL / 4/24/2026

📰 NewsModels & Research

Key Points

  • The paper introduces LLM4MEM, an LLM-based framework for multi-table entity matching that links equivalent entities across multiple sources without relying on unique identifiers.
  • It addresses semantic inconsistencies from numerical attribute variations using a multi-style prompt-enhanced attribute coordination module.
  • To improve matching efficiency as the number of candidate entities grows across sources, it uses a transitive consensus embedding matching module for better embedding and pre-matching.
  • It also mitigates errors from noisy entities via a density-aware pruning module that improves the quality of the matching results.
  • Experiments on six MEM datasets show an average 5.1% improvement in F1 over a baseline model, and the authors provide code on GitHub.

Abstract

Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at https://github.com/Ymeki/LLM4MEM.