GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics
arXiv cs.AI / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- GPT4o-Receipt provides a dataset of 1,235 receipt images pairing GPT-4o-generated receipts with authentic receipts, plus evaluation across five state-of-the-art multimodal LLMs and a 30-annotator perceptual study.
- The study finds that humans are better at perceiving AI artifacts visually but worse at detecting AI-generated documents overall, with annotators showing the largest visual discrimination gap yet lower binary detection F1 than Claude Sonnet 4 and Gemini 2.5 Flash.
- The key forensic signal in AI-generated receipts is arithmetic errors (e.g., incorrect subtotals) that are invisible to visual inspection but verifiable by LLMs in milliseconds.
- The results reveal dramatic performance disparities and calibration differences among models, making simple accuracy metrics unreliable for detector selection, and the authors release GPT4o-Receipt and all results publicly to support future AI document-forensics research.




