Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The paper examines whether journal and conference policies that allow LLMs only for polishing (paraphrasing/grammar correction) are practically enforceable using current AI-text detectors.
Using a dataset of simulated peer reviews with different levels of human–AI collaboration, the authors find that five state-of-the-art detectors (including two commercial systems) frequently misclassify LLM-polished reviews as fully AI-generated.
The resulting false positives create a substantial risk of incorrect accusations of academic misconduct when detectors are used to enforce “polishing-only” rules.
The study tests whether peer-review-specific signals (such as manuscript access and the constrained scientific-writing domain) can improve detection, but reports measurable gains in some settings while still failing to achieve accuracy levels needed for reliable identification of AI use.
The findings caution against relying on public detector-based estimates of how often AI is used in peer reviews, because mixed human–AI outputs may be overstated as pure AI violations.

Abstract

A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.

Black Hat Asia

AI Business

Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack

Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

Dev.to

5 Real Issues With LiteLLM That Are Pushing Teams Away in 2026

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Key Points

Abstract

Related Articles

Black Hat Asia

Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

5 Real Issues With LiteLLM That Are Pushing Teams Away in 2026

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer