SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

arXiv cs.CL / 4/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper highlights a new threat to scholarly integrity: adversarial hidden prompts embedded in submissions can manipulate LLM-based academic peer review.
  • It proposes a Generator–Defender adversarial framework where a Generator creates attack prompts and a Defender model detects them.
  • The joint training uses a loss function inspired by Information Retrieval Generative Adversarial Networks, enabling ongoing co-evolution between attackers and detectors.
  • The authors report that the dynamic, co-evolution approach provides substantially stronger resilience against novel and evolving adversarial threats than static defenses.
  • The work is positioned as a foundational step toward securing the integrity of LLM-driven peer review systems.

Abstract

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.