Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

arXiv cs.CL / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper re-evaluates SemEval-2020 Task 1 as a benchmark for lexical semantic change detection using a framework focused on operationalisation, data quality, and benchmark design.
  • It argues that modeling semantic change primarily as gain/loss/redistribution of discrete senses is too narrow to reflect gradual, constructional, collocational, and discourse-level changes.
  • The authors show the dataset is impacted by significant corpus/preprocessing problems such as OCR noise, malformed characters, truncated sentences, lemmatization and POS-tagging inconsistencies, and missed targets, which can bias model outcomes and reduce reproducibility.
  • They further contend that the benchmark’s small curated target sets and limited language coverage make it less realistic and increase statistical uncertainty, so it should be treated as a partial test bed rather than a definitive measure of progress.
  • The paper calls for future datasets and shared tasks to use broader semantic-change theories, publish preprocessing transparently, expand cross-linguistic coverage, and adopt more realistic evaluation settings.

Abstract

This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection