AI Navigate

OpenSanctions Pairs: Large-Scale Entity Matching with LLMs

arXiv cs.AI / 3/13/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The OpenSanctions Pairs dataset covers 755,540 labeled pairs from 293 sources across 31 countries, featuring multilingual and cross-script names, noisy attributes, and set-valued fields typical of compliance workflows.
  • In benchmarking, a production rule-based matcher (Nomenklatura RegressionV1) is outperformed by LLMs in zero- and few-shot settings, with up to 98.95% F1 from GPT-4o (and 98.23% F1 from a locally deployable open model, DeepSeek-R1-Distill-Qwen-14B).
  • DSPy MIPROv2 prompt optimization yields consistent but modest gains; adding in-context examples provides little extra benefit and can degrade performance.
  • Error analysis shows rule-based systems over-match (false positives) while LLMs struggle with cross-script transliteration and minor identifier/date inconsistencies, suggesting a shift toward blocking, clustering, and uncertainty-aware review.
  • The work indicates pairwise matching performance is nearing a practical ceiling, and code for the project is available on GitHub.

Abstract

We release OpenSanctions Pairs, a large-scale entity matching benchmark derived from real-world international sanctions aggregation and analyst deduplication. The dataset contains 755,540 labeled pairs spanning 293 heterogeneous sources across 31 countries, with multilingual and cross-script names, noisy and missing attributes, and set-valued fields typical of compliance workflows. We benchmark a production rule-based matcher (nomenklatura RegressionV1 algorithm) against open- and closed-source LLMs in zero- and few-shot settings. Off-the-shelf LLMs substantially outperform the production rule-based baseline (91.33\% F1), reaching up to 98.95\% F1 (GPT-4o) and 98.23\% F1 with a locally deployable open model (DeepSeek-R1-Distill-Qwen-14B). DSPy MIPROv2 prompt optimization yields consistent but modest gains, while adding in-context examples provides little additional benefit and can degrade performance. Error analysis shows complementary failure modes: the rule-based system over-matches (high false positives), whereas LLMs primarily fail on cross-script transliteration and minor identifier/date inconsistencies. These results indicate that pairwise matching performance is approaching a practical ceiling in this setting, and motivate shifting effort toward pipeline components such as blocking, clustering, and uncertainty-aware review. Code available at https://github.com/chansmi/OSINT_entity_resolution