Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

arXiv cs.CL / 3/26/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents a system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) focused on clustering inconsistent software mentions across scientific corpora.
It uses a hybrid pipeline combining Sentence-BERT semantic embeddings, FAISS-based KB lookup over training-set cluster centroids, and HDBSCAN density-based clustering for mentions not confidently matched to existing clusters.
The approach improves canonicalization via surface-form normalization and abbreviation resolution, and reuses the same core pipeline across CDCR Subtasks 1 and 2.
For the large-scale Subtask 3, it introduces a blocking strategy based on entity types and canonicalized surface forms to make clustering more efficient.
Reported performance is very high, with CoNLL F1 scores of 0.98, 0.98, and 0.96 for Subtasks 1, 2, and 3, respectively.

Abstract

This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

They Did Not Accidentally Make Work the Answer to Who You Are

Dev.to

Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

They Did Not Accidentally Make Work the Answer to Who You Are

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer