Are Large Language Models Truly Smarter Than Humans?

arXiv cs.AI / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper conducts three complementary experiments to audit contamination in six frontier LLMs, revealing notable training-data leakage in public benchmarks.
Across 513 MMLU questions, the lexical contamination pipeline finds an overall contamination rate of 13.8%, with rates as high as 66.7% in Philosophy and estimated performance gains of +0.030 to +0.054 accuracy points by category.
Indirect-reference testing shows accuracy declines of about 7.0 percentage points on average, rising to 19.8 percentage points in Law and Ethics, indicating reliance on memorized or paraphrased content.
Behavioral probes reveal 72.5% of questions trigger memorization signals, with DeepSeek-R1 displaying a distinctive memorization pattern, and all experiments ranking contamination as STEM > Professional > Social Sciences > Humanities.

Abstract

Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/18DailyView insight →

CRM Development That Drives Growth

Dev.to

Karpathy's Autoresearch: Improving Agentic Coding Skills

Dev.to

How to Write AI Prompts That Actually Work

Dev.to

[D] Any other PhD students feel underprepared and that the bar is too low?

Reddit r/MachineLearning

Automating the Perfect Pitch: An AI Framework for Boutique PR

Dev.to

Are Large Language Models Truly Smarter Than Humans?

Key Points

Abstract

💡 Insights using this article

Related Articles

CRM Development That Drives Growth

Karpathy's Autoresearch: Improving Agentic Coding Skills

How to Write AI Prompts That Actually Work

[D] Any other PhD students feel underprepared and that the bar is too low?

Automating the Perfect Pitch: An AI Framework for Boutique PR

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer