AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

arXiv cs.CV / 5/1/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces AEGIS, a holistic benchmark to evaluate forensic analysis of AI-generated academic images across seven academic categories and 39 fine-grained subtypes.
AEGIS expands beyond prior work by incorporating domain-specific complexity, where even GPT-5.1 achieves 48.80% overall performance and expert models show limited localization accuracy (IoU 30.09%).
It uses diverse forgery simulations based on four common academic forgery strategies implemented across 25 generative models, finding that forensic accuracy often remains below 50% and lags behind generation capabilities.
The benchmark evaluates forensics in multiple dimensions—detection, reasoning, and localization—showing complementary strengths between model families (e.g., MLLMs at 84.74% for textual artifact recognition and expert detectors at 79.54% for binary authenticity detection).
By testing 25 leading MLLMs, nine expert models, and a unified multimodal understanding/generation model, AEGIS is positioned as a diagnostic testbed exposing fundamental limitations in current academic image forensics.

Abstract

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Dev.to

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Dev.to

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Dev.to

GitHub - intel/auto-round: A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

Reddit r/LocalLLaMA

ChatGPT's goblin obsession may be hilarious, but it points to a deeper problem in AI training

THE DECODER

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Key Points

Abstract

Related Articles

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

GitHub - intel/auto-round: A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

ChatGPT's goblin obsession may be hilarious, but it points to a deeper problem in AI training

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer