THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics
arXiv cs.CV / 3/27/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- THEMIS is a new multi-task benchmark for evaluating multimodal large language models (MLLMs) on visual reasoning tasks related to scientific paper fraud forensics.
- The benchmark includes 4,000+ questions across seven real-world and synthetic multimodal scenarios, aiming to match the complexity seen in authentic retracted-paper cases.
- It introduces broader and more granular coverage of fraud by including five fraud types and 16 fine-grained manipulation operations, often stacked within a single sample.
- THEMIS evaluates models across five core visual fraud-reasoning capabilities mapped from fraud types, enabling diagnosis of strengths and weaknesses per capability.
- Results across 16 leading MLLMs show low overall performance (best model GPT-5 at 56.15%), indicating the benchmark is a stringent and challenging test for fraud reasoning.
Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data
Dev.to
Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots
Dev.to

Data Sovereignty Rules and Enterprise AI
Dev.to