ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

arXiv cs.LG / 4/23/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • ThermoQA is a new benchmark with 293 open-ended engineering thermodynamics problems split into three tiers: property lookups, component analysis, and full cycle analysis.
  • Ground truth answers are generated programmatically using CoolProp 7.2.0 and span working fluids including water, R-134a, and variable-cp air.
  • Six frontier LLMs were evaluated with three independent runs each, producing a composite leaderboard led by Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro.
  • The results show cross-tier degradation that suggests property memorization alone is not equivalent to genuine thermodynamic reasoning.
  • The dataset and code are released as open source on Hugging Face, enabling reproducible evaluation of thermodynamic reasoning consistency.

Abstract

We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa