Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

arXiv cs.CL / 3/26/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) focused on clustering inconsistent software mentions across scientific corpora.
  • It uses a hybrid pipeline combining Sentence-BERT semantic embeddings, FAISS-based KB lookup over training-set cluster centroids, and HDBSCAN density-based clustering for mentions not confidently matched to existing clusters.
  • The approach improves canonicalization via surface-form normalization and abbreviation resolution, and reuses the same core pipeline across CDCR Subtasks 1 and 2.
  • For the large-scale Subtask 3, it introduces a blocking strategy based on entity types and canonicalized surface forms to make clustering more efficient.
  • Reported performance is very high, with CoNLL F1 scores of 0.98, 0.98, and 0.96 for Subtasks 1, 2, and 3, respectively.

Abstract

This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.