Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates language-of-study (LoS) bias in NLP peer review, where reviewer judgments may shift based on the languages studied rather than scientific merit.
It presents the first systematic characterization of LoS bias, separating negative vs. positive forms and showing that non-English papers experience substantially higher bias rates than English-only papers.
Using analysis of 15,645 reviews, the study finds negative bias consistently outweighs positive bias, with one dominant subtype being the demand for unjustified cross-lingual generalization.
The authors introduce the human-annotated dataset LOBSTER and a detection method that achieves 87.37 macro F1, aiming to enable more reliable identification of this bias.
All resources are publicly released to support fairer reviewing practices in NLP and potentially other fields.

Abstract

Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.