Are Large Language Models Truly Smarter Than Humans?
arXiv cs.AI / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper conducts three complementary experiments to audit contamination in six frontier LLMs, revealing notable training-data leakage in public benchmarks.
- Across 513 MMLU questions, the lexical contamination pipeline finds an overall contamination rate of 13.8%, with rates as high as 66.7% in Philosophy and estimated performance gains of +0.030 to +0.054 accuracy points by category.
- Indirect-reference testing shows accuracy declines of about 7.0 percentage points on average, rising to 19.8 percentage points in Law and Ethics, indicating reliance on memorized or paraphrased content.
- Behavioral probes reveal 72.5% of questions trigger memorization signals, with DeepSeek-R1 displaying a distinctive memorization pattern, and all experiments ranking contamination as STEM > Professional > Social Sciences > Humanities.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
CRM Development That Drives Growth
Dev.to

Karpathy's Autoresearch: Improving Agentic Coding Skills
Dev.to
How to Write AI Prompts That Actually Work
Dev.to
[D] Any other PhD students feel underprepared and that the bar is too low?
Reddit r/MachineLearning
Automating the Perfect Pitch: An AI Framework for Boutique PR
Dev.to