Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection
arXiv cs.CL / 4/16/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper re-evaluates SemEval-2020 Task 1 as a benchmark for lexical semantic change detection using a framework focused on operationalisation, data quality, and benchmark design.
- It argues that modeling semantic change primarily as gain/loss/redistribution of discrete senses is too narrow to reflect gradual, constructional, collocational, and discourse-level changes.
- The authors show the dataset is impacted by significant corpus/preprocessing problems such as OCR noise, malformed characters, truncated sentences, lemmatization and POS-tagging inconsistencies, and missed targets, which can bias model outcomes and reduce reproducibility.
- They further contend that the benchmark’s small curated target sets and limited language coverage make it less realistic and increase statistical uncertainty, so it should be treated as a partial test bed rather than a definitive measure of progress.
- The paper calls for future datasets and shared tasks to use broader semantic-change theories, publish preprocessing transparently, expand cross-linguistic coverage, and adopt more realistic evaluation settings.
Related Articles

Introducing Claude Opus 4.7
Anthropic News

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to

Config-first code generator to replace repetitive AI boilerplate — looking for feedback and collaborators
Dev.to

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs
Dev.to