THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

arXiv cs.CV / 3/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

THEMIS is a new multi-task benchmark for evaluating multimodal large language models (MLLMs) on visual reasoning tasks related to scientific paper fraud forensics.
The benchmark includes 4,000+ questions across seven real-world and synthetic multimodal scenarios, aiming to match the complexity seen in authentic retracted-paper cases.
It introduces broader and more granular coverage of fraud by including five fraud types and 16 fine-grained manipulation operations, often stacked within a single sample.
THEMIS evaluates models across five core visual fraud-reasoning capabilities mapped from fraud types, enabling diagnosis of strengths and weaknesses per capability.
Results across 16 leading MLLMs show low overall performance (best model GPT-5 at 56.15%), indicating the benchmark is a stringent and challenging test for fraud reasoning.

Abstract

We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

Data Sovereignty Rules and Enterprise AI

Dev.to

THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Data Sovereignty Rules and Enterprise AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer