Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution
arXiv cs.CL / 3/26/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper presents a system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) focused on clustering inconsistent software mentions across scientific corpora.
- It uses a hybrid pipeline combining Sentence-BERT semantic embeddings, FAISS-based KB lookup over training-set cluster centroids, and HDBSCAN density-based clustering for mentions not confidently matched to existing clusters.
- The approach improves canonicalization via surface-form normalization and abbreviation resolution, and reuses the same core pipeline across CDCR Subtasks 1 and 2.
- For the large-scale Subtask 3, it introduces a blocking strategy based on entity types and canonicalized surface forms to make clustering more efficient.
- Reported performance is very high, with CoNLL F1 scores of 0.98, 0.98, and 0.96 for Subtasks 1, 2, and 3, respectively.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to
How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)
Dev.to
How Should Students Document AI Usage in Academic Work?
Dev.to
They Did Not Accidentally Make Work the Answer to Who You Are
Dev.to