ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

arXiv cs.LG / 4/23/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

ThermoQA is a new benchmark with 293 open-ended engineering thermodynamics problems split into three tiers: property lookups, component analysis, and full cycle analysis.
Ground truth answers are generated programmatically using CoolProp 7.2.0 and span working fluids including water, R-134a, and variable-cp air.
Six frontier LLMs were evaluated with three independent runs each, producing a composite leaderboard led by Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro.
The results show cross-tier degradation that suggests property memorization alone is not equivalent to genuine thermodynamic reasoning.
The dataset and code are released as open source on Hugging Face, enabling reproducible evaluation of thermodynamic reasoning consistency.

Abstract

We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa