Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

arXiv cs.CL / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces MSR-MEL, an unsupervised Multimodal Entity Linking framework that uses LLM-based multi-perspective evidence synthesis and reasoning rather than only optimizing instance-centric signals.
  • It uses a two-stage design: offline evidence synthesis builds multiple evidence types (instance-centric multimodal, group-level graph-aggregated neighborhood, lexical overlap, and statistical summaries) with LLM-enhanced contextualized graphs.
  • For group-level evidence, the method constructs LLM-enhanced graphs and aligns modalities via an asymmetric teacher-student graph neural network to capture interdependencies among neighborhood information.
  • In the online stage, an LLM acts as a reasoning module to analyze correlations and semantics across evidence types, producing an effective ranking strategy for entity linking without supervision.
  • Experiments on common MEL benchmarks show MSR-MEL consistently outperforms existing state-of-the-art unsupervised methods, and the authors provide source code.

Abstract

Multimodal Entity Linking (MEL) is a fundamental task in data management that maps ambiguous mentions with diverse modalities to the multimodal entities in a knowledge base. However, most existing MEL approaches primarily focus on optimizing instance-centric features and evidence, leaving broader forms of evidence and their intricate interdependencies insufficiently explored. Motivated by the observation that human expert decision-making process relies on multi-perspective judgment, in this work, we propose MSR-MEL, a Multi-perspective Evidence Synthesis and Reasoning framework with Large Language Models (LLMs) for unsupervised MEL. Specifically, we adopt a two-stage framework: (1) Offline Multi-Perspective Evidence Synthesis constructs a comprehensive set of evidence. This includes instance-centric evidence capturing the instance-centric multimodal information of mentions and entities, group-level evidence that aggregates neighborhood information, lexical evidence based on string overlap ratio, and statistical evidence based on simple summary statistics. A core contribution of our framework is the synthesis of group-level evidence, which effectively aggregates vital neighborhood information by graph. We first construct LLM-enhanced contextualized graphs. Subsequently, different modalities are jointly aligned through an asymmetric teacher-student graph neural network. (2) Online Multi-Perspective Evidence Reasoning leverages the power of LLM as a reasoning module to analyze the correlation and semantics of the multi-perspective evidence to induce an effective ranking strategy for accurate entity linking without supervision. Extensive experiments on widely used MEL benchmarks demonstrate that MSR-MEL consistently outperforms state-of-the-art unsupervised methods. The source code of this paper was available at: https://anonymous.4open.science/r/MSR-MEL-C21E/.