Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that common evaluation practices for NLP—especially with the rise of large language models—have significant methodological concerns that merit re-examination.
  • It performs a scoping review of prior NLP research and proposes a taxonomy that organizes recurring evaluation concerns and the trade-offs associated with them.
  • The taxonomy is intended to help unify ongoing debates by locating current criticisms within the field’s long-standing history of evaluation methodology reflection.
  • The authors derive practical implications, including a structured checklist designed to improve how evaluation is planned, executed, and interpreted.
  • Overall, the work offers a consolidated reference to support more deliberate and defensible evaluation design in natural language processing.

Abstract

Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.