When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents

arXiv cs.CV / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper reports that GPT-Image-2 can generate or edit document images (e.g., receipt fields) in under a second with low cost, effectively blurring the visual line between authentic and AI-altered documents.
  • The authors release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 forgeries with pixel-precise masks in DocTamper-compatible format, along with benchmarks using human inspection and three computational detection approaches.
  • Human inspectors’ accuracy in distinguishing AI forgeries from real documents is 0.501 (near chance), and the computational judges perform only modestly above chance (TruFor 0.599, DocTamper 0.585, and GPT-Image-2 used as a zero-shot self-judge 0.532).
  • The “self-judge” strategy fails consistently across multiple prompt and ambiguity-handling policies, with AUC never exceeding 0.59, indicating GPT-Image-2 cannot reliably recognize its own inpainting/editing.
  • Calibration on same-domain traditional tampering shows the detectors work well on non-AI edits (TruFor AUC 0.962, DocTamper AUC 0.852), but performance drops by 0.27–0.36 when GPT-Image-2 inpainting is used, isolating a GPT-Image-2-specific detection gap; the dataset, pipeline, protocol, and calibration sets are released.

Abstract

OpenAI's GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document images: a single number on a receipt can be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper-compatible format, and benchmark four lines of defence: human inspectors (N=120, n=365 pair-votes via the public 2AFC site CanUSpotAI.com), TruFor (generic forensic), DocTamper (qcf-568, document-specific), and the same GPT-Image-2 model as a zero-shot self-judge -- asked, to avoid the trivial "image is mostly real" reading, whether any region was generated or edited by an AI image model. Human 2AFC accuracy is 0.501, indistinguishable from chance: even side-by-side, inspectors cannot tell GPT-Image-2 receipt forgeries from authentic counterparts. The three computational judges sit only modestly above (TruFor 0.599, DocTamper 0.585, self-judge 0.532). The self-judge fails consistently, not by chance: across five prompt strategies and four policies for handling ambiguous responses, AUC never rises above 0.59. To rule out the possibility that the two forensic detectors are broken on our source domain rather than blind to AI inpainting, we calibrate each on a same-domain traditional-tampering set built for its training distribution: TruFor reaches AUC 0.962 on cross-camera splicing of our dataset, DocTamper reaches 0.852 on cross-document OCR-token splicing with two-pass JPEG re-encoding. Both retain near-published performance on traditional tampering; switching to GPT-Image-2 inpainting drops AUC by 0.27-0.36 (0.962->0.599 TruFor; 0.852->0.585 DocTamper), isolating a detection gap specific to GPT-Image-2 inpainting. We release the dataset, pipeline, four-judge protocol, and calibration sets.