How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

arXiv cs.CL / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a stage-wise, problem-oriented evaluation framework to test LLM end-to-end mathematical modeling ability using expert-verified criteria across workflow stages.
It validates the framework by showing stronger agreement between its automatic scores and independent human expert judgments than existing evaluation schemes on contest problems.
Results reveal a persistent “comprehension–execution gap”: LLMs do well at early stages (problem identification and formulation) but struggle in execution stages such as solving, code implementation, and result analysis.
The study finds that simply scaling up model size does not eliminate these gaps, and attributes failures to insufficient specification, missing verification, and lack of validation with errors propagating across stages.
The authors argue that closing the gap will require methods beyond scaling, offering guidance for deploying LLMs on complex real-world problem-solving workflows.

Abstract

Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Could it be that this take is not too far fetched?

Reddit r/LocalLLaMA

npm audit Is Broken — Here's the Claude Code Skill I Built to Fix It

Dev.to

Meta Launches Muse Spark: A New AI Model for Everyday Use

Dev.to

TurboQuant on a MacBook: building a one-command local stack with Ollama, MLX, and an automatic routing proxy

Dev.to

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Could it be that this take is not too far fetched?

npm audit Is Broken — Here's the Claude Code Skill I Built to Fix It

Meta Launches Muse Spark: A New AI Model for Everyday Use

TurboQuant on a MacBook: building a one-command local stack with Ollama, MLX, and an automatic routing proxy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer