Finding Duplicates in 1.1M BDD Steps: cukereuse, a Paraphrase-Robust Static Detector for Cucumber and Gherkin

arXiv cs.CL / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces cukereuse, an open-source, purely static (no test execution) Python CLI to detect duplicate Cucumber/Gherkin steps in any repository while being robust to paraphrases.
  • cukereuse uses a layered approach combining exact hashing, Levenshtein ratio, and sentence-transformer embeddings, and the release includes a large corpus (347 GitHub repos, 23,667 .feature files, 1,113,616 Gherkin steps).
  • The authors quantify duplication prevalence, finding a step-weighted exact-duplicate rate of 80.2% and a median per-repository exact-duplicate rate of 58.6%, with a largest hybrid cluster containing 20.7k occurrences across 2.2k files.
  • Evaluation against 1,020 manually labeled step pairs (high inter-annotator agreement, Fleiss’ kappa = 0.84) reports strong pair-level performance, with the best near-exact F1 reaching about 0.822 and semantic F1 about 0.906 under the primary rubric, noting an inflation artefact affecting recall.
  • Beyond detection, the paper provides a CDN-style critique of Gherkin, arguing that most cognitive dimensions are problematic or not supported, and releases the tool, corpus, labels, rubric, and pipeline under permissive licenses.

Abstract

Behaviour-Driven Development (BDD) suites accumulate step-text duplication whose maintenance cost is established in prior work. Existing detection techniques require running the tests (Binamungu et al., 2018-2023) or are confined to a single organisation (Irshad et al., 2020-2022), leaving a gap: a purely static, paraphrase-robust, step-level detector usable on any repository. We fill the gap with cukereuse, an open-source Python CLI combining exact hashing, Levenshtein ratio, and sentence-transformer embeddings in a layered pipeline, released alongside an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature files, and 1,113,616 Gherkin steps. The step-weighted exact-duplicate rate is 80.2 %; the median-repository rate is 58.6 % (Spearman rho = 0.51 with size). The top hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs manually labelled by the three authors under a released rubric (inter-annotator Fleiss' kappa = 0.84 on a 60-pair overlap), we report precision, recall, and F1 with bootstrap 95 % CIs under two protocols: the primary rubric and a score-free second-pass relabelling. The strongest honest pair-level number is near-exact at F1 = 0.822 on score-free labels; the primary-rubric semantic F1 = 0.906 is inflated by a stratification artefact that pins recall at 1.000. Lexical baselines (SourcererCC-style, NiCad-style) reach primary F1 = 0.761 and 0.799. The paper also presents a CDN-structured critique of Gherkin (Cognitive Dimensions of Notations); eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus, labelled pairs, rubric, and pipeline are released under permissive licences.