AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

arXiv cs.CV / 5/1/2026

📰 NewsModels & Research

Key Points

  • The paper introduces AEGIS, a holistic benchmark to evaluate forensic analysis of AI-generated academic images across seven academic categories and 39 fine-grained subtypes.
  • AEGIS expands beyond prior work by incorporating domain-specific complexity, where even GPT-5.1 achieves 48.80% overall performance and expert models show limited localization accuracy (IoU 30.09%).
  • It uses diverse forgery simulations based on four common academic forgery strategies implemented across 25 generative models, finding that forensic accuracy often remains below 50% and lags behind generation capabilities.
  • The benchmark evaluates forensics in multiple dimensions—detection, reasoning, and localization—showing complementary strengths between model families (e.g., MLLMs at 84.74% for textual artifact recognition and expert detectors at 79.54% for binary authenticity detection).
  • By testing 25 leading MLLMs, nine expert models, and a unified multimodal understanding/generation model, AEGIS is positioned as a diagnostic testbed exposing fundamental limitations in current academic image forensics.

Abstract

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.