BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

arXiv cs.AI / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

BrainBench introduces a benchmark of 100 brainteaser questions across 20 categories designed to probe specific commonsense reasoning failure modes in large language models.
The study evaluates eight frontier models—four Claude variants and four GPT variants—using a zero-shot protocol with 10 independent runs per question, finding Claude Opus 4.6 with extended thinking at 80.3% accuracy and GPT-4o at 39.7%.
The results reveal a gap between accuracy and consistency of 6–16 percentage points, indicating stochastic reasoning behavior in top models.
Cross-lingual evaluation in Chinese shows 2–8 percentage-point degradations, suggesting the weaknesses are due to reasoning deficits rather than language-specific issues.
BrainBench provides a fine-grained diagnostic tool to locate where LLMs rely on surface heuristics instead of genuine commonsense reasoning.

Abstract

Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints ("Should I walk or drive my rental car to the return lot?") to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models -- four from the Claude family and four from the GPT family -- using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

Dev.to

How I built a 4-product AI income stack in 4 months (the honest version)

Dev.to

I stopped writing AI prompts from scratch. Here is the system I built instead.

Dev.to

BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

Key Points

Abstract

Related Articles

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

How I built a 4-product AI income stack in 4 months (the honest version)

I stopped writing AI prompts from scratch. Here is the system I built instead.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer