A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies entity resolution in master data management by modeling entity–attribute relationships as typed bipartite (and optionally entity–entity) graphs processed with MPNNs.
  • It develops a four-theorem separation framework using two logical predicates (Dup_r for detecting at least r shared attribute values, and Cyc_ell for detecting ℓ-cycles) to ask for the cheapest provably sufficient GNN architecture.
  • The key result is a tight expressivity/complexity gap: detecting any shared attribute is local and can be done with a minimal two-layer MPNN using reverse message passing, while detecting multiple shared attributes requires non-local cross-attribute correlation.
  • For Dup_r (with r>1) and related cycle detection, the paper shows that ego IDs and a four-layer depth are fundamentally necessary, even on acyclic bipartite graphs, and that simpler MPNN variants cannot compute these predicates on all inputs.
  • It also provides constructive minimal-depth MPNN architectures and reports computational validation matching the theoretical predictions, enabling a “minimal-architecture principle” for practitioners.

Abstract

Entity resolution -- identifying database records that refer to the same real-world entity -- is naturally modelled on bipartite graphs connecting entity nodes to their attribute values. Applying a message-passing neural network (MPNN) with all available extensions (reverse message passing, port numbering, ego IDs) incurs unnecessary overhead, since different entity resolution tasks have fundamentally different complexity. For a given matching criterion, what is the cheapest MPNN architecture that provably works? We answer this with a four-theorem separation theory on typed entity-attribute graphs. We introduce co-reference predicates \mathrm{Dup}_r (two same-type entities share at least r attribute values) and the \ell-cycle predicate \mathrm{Cyc}_\ell for settings with entity-entity edges. For each predicate we prove tight bounds -- constructing graph pairs provably indistinguishable by every MPNN lacking the required adaptation, and exhibiting explicit minimal-depth MPNNs that compute the predicate on all inputs. The central finding is a sharp complexity gap between detecting any shared attribute and detecting multiple shared attributes. The former is purely local, requiring only reverse message passing in two layers. The latter demands cross-attribute identity correlation -- verifying that the same entity appears at several attributes of the target -- a fundamentally non-local requirement needing ego IDs and four layers, even on acyclic bipartite graphs. A similar necessity holds for cycle detection. Together, these results yield a minimal-architecture principle: practitioners can select the cheapest sufficient adaptation set, with a guarantee that no simpler architecture works. Computational validation confirms every prediction.