Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces IRM (Implicit Reward Model), a zero-shot method for detecting text generated by LLMs using implicit reward modeling.
IRM can be built from publicly available instruction-tuned and base models, avoiding reliance on specialized, task-specific fine-tuning.
Unlike prior reward-based approaches that require preference construction and additional training, IRM does not need preference data collection or further model training.
Experiments on the DetectRL benchmark show IRM achieves stronger detection performance, outperforming existing zero-shot and supervised methods for LLM-generated text detection.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.