A Training-free Method for LLM Text Attribution

arXiv stat.ML / 2026/3/24

💬 オピニオンIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

要点

The paper introduces a training-free (zero-shot) approach to attribute LLM-generated text to a specific known LLM set, while maintaining a guaranteed low false-positive rate.
It models LLM text as a sequential stochastic process with dependence on history and develops statistical tests to distinguish between sanctioned (in-house) and non-sanctioned (external) LLM sources and to detect whether text came from any known model versus an unknown one.
The authors prove that both Type I (false positive) and Type II (false negative) error rates decrease exponentially as the text length increases, and they provide information-theoretic lower bounds to show the results are tight.
They extend the method to black-box access via sampling, characterizing sample size requirements that achieve error bounds comparable to the white-box setting.
Experiments—including evaluation under adversarial post-editing—support the theoretical claims, and the authors release code, data, and an online demo for practical use in provenance verification, misinformation mitigation, and AI compliance.

Abstract

Verifying the provenance of content is crucial to the functioning of many organizations, e.g., educational institutions, social media platforms, and firms. This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions use in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within their institutions. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM, while ensuring a guaranteed low false positive rate? We model LLM text as a sequential stochastic process with complete dependence on history. We then design zero-shot statistical tests to (i) distinguish between text generated by two different known sets of LLMs

A

(non-sanctioned) and

B

(in-house), and (ii) identify whether text was generated by a known LLM or by any unknown model. We prove that the Type I and Type II errors of our test decrease exponentially with the length of the text. We also extend our theory to black-box access via sampling and characterize the required sample size to obtain essentially the same Type I and Type II error upper bounds as in the white-box setting (i.e., with access to

A

). We show the tightness of our upper bounds by providing an information-theoretic lower bound. We next present numerical experiments to validate our theoretical results and assess their robustness in settings with adversarial post-editing. Our work has a host of practical applications in which determining the origin of a text is important and can also be useful for combating misinformation and ensuring compliance with emerging AI regulations. See https://github.com/TaraRadvand74/llm-text-detection for code, data, and an online demo of the project.