Are Large Language Models Truly Smarter Than Humans?
arXiv cs.AI / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper conducts three complementary experiments to audit contamination in six frontier LLMs, revealing notable training-data leakage in public benchmarks.
- Across 513 MMLU questions, the lexical contamination pipeline finds an overall contamination rate of 13.8%, with rates as high as 66.7% in Philosophy and estimated performance gains of +0.030 to +0.054 accuracy points by category.
- Indirect-reference testing shows accuracy declines of about 7.0 percentage points on average, rising to 19.8 percentage points in Law and Ethics, indicating reliance on memorized or paraphrased content.
- Behavioral probes reveal 72.5% of questions trigger memorization signals, with DeepSeek-R1 displaying a distinctive memorization pattern, and all experiments ranking contamination as STEM > Professional > Social Sciences > Humanities.
Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer
The Batch

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".
Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development
Dev.to

**Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems**
Dev.to

KI in der amtlichen Recherche beim DPMA: Was Patentanwälte bei Neuanmeldungen jetzt beachten sollten (Stand: März 2026)
Dev.to