FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

arXiv cs.AI / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces FeynmanBench, a new benchmark specifically designed to test multimodal LLMs on Feynman-diagram-based physics reasoning rather than only local information extraction.
The benchmark evaluates multistep capabilities including enforcing conservation laws and symmetries, determining graph topology, translating between diagrammatic and algebraic forms, and constructing scattering amplitudes under defined conventions and gauges.
An automated pipeline generates diverse Standard Model Feynman diagrams with verifiable topological annotations and corresponding amplitude results, enabling large-scale and reproducible evaluation.
The dataset covers electromagnetic, weak, and strong interactions, includes 100+ distinct diagram types, and provides 2000+ tasks.
Experiments show consistent failure modes in leading multimodal LLMs, such as unstable physical-constraint enforcement and incorrect global topological reasoning, underscoring the need for physics-grounded visual reasoning benchmarks.

Abstract

Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI's capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results. Our database spans the electromagnetic, weak, and strong interactions of the Standard Model, encompasses over 100 distinct types and includes more than 2000 tasks. Experiments on state-of-the-art MLLMs reveal systematic failure modes, including unstable enforcement of physical constraints and violations of global topological conditions, highlighting the need for physics-grounded benchmarks for visual reasoning over scientific notation. FeynmanBench provides a logically rigorous test of whether AI can effectively engage in scientific discovery, particularly within theoretical physics.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer