When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents

arXiv cs.CV / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper reports that GPT-Image-2 can generate or edit document images (e.g., receipt fields) in under a second with low cost, effectively blurring the visual line between authentic and AI-altered documents.
The authors release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 forgeries with pixel-precise masks in DocTamper-compatible format, along with benchmarks using human inspection and three computational detection approaches.
Human inspectors’ accuracy in distinguishing AI forgeries from real documents is 0.501 (near chance), and the computational judges perform only modestly above chance (TruFor 0.599, DocTamper 0.585, and GPT-Image-2 used as a zero-shot self-judge 0.532).
The “self-judge” strategy fails consistently across multiple prompt and ambiguity-handling policies, with AUC never exceeding 0.59, indicating GPT-Image-2 cannot reliably recognize its own inpainting/editing.
Calibration on same-domain traditional tampering shows the detectors work well on non-AI edits (TruFor AUC 0.962, DocTamper AUC 0.852), but performance drops by 0.27–0.36 when GPT-Image-2 inpainting is used, isolating a GPT-Image-2-specific detection gap; the dataset, pipeline, protocol, and calibration sets are released.

Abstract

OpenAI's GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document images: a single number on a receipt can be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper-compatible format, and benchmark four lines of defence: human inspectors (N=120, n=365 pair-votes via the public 2AFC site CanUSpotAI.com), TruFor (generic forensic), DocTamper (qcf-568, document-specific), and the same GPT-Image-2 model as a zero-shot self-judge -- asked, to avoid the trivial "image is mostly real" reading, whether any region was generated or edited by an AI image model. Human 2AFC accuracy is 0.501, indistinguishable from chance: even side-by-side, inspectors cannot tell GPT-Image-2 receipt forgeries from authentic counterparts. The three computational judges sit only modestly above (TruFor 0.599, DocTamper 0.585, self-judge 0.532). The self-judge fails consistently, not by chance: across five prompt strategies and four policies for handling ambiguous responses, AUC never rises above 0.59. To rule out the possibility that the two forensic detectors are broken on our source domain rather than blind to AI inpainting, we calibrate each on a same-domain traditional-tampering set built for its training distribution: TruFor reaches AUC 0.962 on cross-camera splicing of our dataset, DocTamper reaches 0.852 on cross-document OCR-token splicing with two-pass JPEG re-encoding. Both retain near-published performance on traditional tampering; switching to GPT-Image-2 inpainting drops AUC by 0.27-0.36 (0.962->0.599 TruFor; 0.852->0.585 DocTamper), isolating a detection gap specific to GPT-Image-2 inpainting. We release the dataset, pipeline, four-judge protocol, and calibration sets.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

LLMs will be a commodity

Reddit r/artificial

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents

Key Points

Abstract

💡 Insights using this article

Related Articles

LLMs will be a commodity

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dex lands $5.3M to grow its AI-driven talent matching platform

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer