Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • The paper examines whether journal and conference policies that allow LLMs only for polishing (paraphrasing/grammar correction) are practically enforceable using current AI-text detectors.
  • Using a dataset of simulated peer reviews with different levels of human–AI collaboration, the authors find that five state-of-the-art detectors (including two commercial systems) frequently misclassify LLM-polished reviews as fully AI-generated.
  • The resulting false positives create a substantial risk of incorrect accusations of academic misconduct when detectors are used to enforce “polishing-only” rules.
  • The study tests whether peer-review-specific signals (such as manuscript access and the constrained scientific-writing domain) can improve detection, but reports measurable gains in some settings while still failing to achieve accuracy levels needed for reliable identification of AI use.
  • The findings caution against relying on public detector-based estimates of how often AI is used in peer reviews, because mixed human–AI outputs may be overstated as pure AI violations.

Abstract

A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.