A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies entity resolution in master data management by modeling entity–attribute relationships as typed bipartite (and optionally entity–entity) graphs processed with MPNNs.
It develops a four-theorem separation framework using two logical predicates (Dup_r for detecting at least r shared attribute values, and Cyc_ell for detecting ℓ-cycles) to ask for the cheapest provably sufficient GNN architecture.
The key result is a tight expressivity/complexity gap: detecting any shared attribute is local and can be done with a minimal two-layer MPNN using reverse message passing, while detecting multiple shared attributes requires non-local cross-attribute correlation.
For Dup_r (with r>1) and related cycle detection, the paper shows that ego IDs and a four-layer depth are fundamentally necessary, even on acyclic bipartite graphs, and that simpler MPNN variants cannot compute these predicates on all inputs.
It also provides constructive minimal-depth MPNN architectures and reports computational validation matching the theoretical predictions, enabling a “minimal-architecture principle” for practitioners.

Abstract

Entity resolution -- identifying database records that refer to the same real-world entity -- is naturally modelled on bipartite graphs connecting entity nodes to their attribute values. Applying a message-passing neural network (MPNN) with all available extensions (reverse message passing, port numbering, ego IDs) incurs unnecessary overhead, since different entity resolution tasks have fundamentally different complexity. For a given matching criterion, what is the cheapest MPNN architecture that provably works? We answer this with a four-theorem separation theory on typed entity-attribute graphs. We introduce co-reference predicates

\mathrm{Dup}_r

(two same-type entities share at least

r

attribute values) and the

\ell

-cycle predicate

\mathrm{Cyc}_\ell

for settings with entity-entity edges. For each predicate we prove tight bounds -- constructing graph pairs provably indistinguishable by every MPNN lacking the required adaptation, and exhibiting explicit minimal-depth MPNNs that compute the predicate on all inputs. The central finding is a sharp complexity gap between detecting any shared attribute and detecting multiple shared attributes. The former is purely local, requiring only reverse message passing in two layers. The latter demands cross-attribute identity correlation -- verifying that the same entity appears at several attributes of the target -- a fundamentally non-local requirement needing ego IDs and four layers, even on acyclic bipartite graphs. A similar necessity holds for cycle detection. Together, these results yield a minimal-architecture principle: practitioners can select the cheapest sufficient adaptation set, with a guarantee that no simpler architecture works. Computational validation confirms every prediction.

Black Hat Asia

AI Business

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

Dev.to

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management

Key Points

Abstract

Related Articles

Black Hat Asia

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer