ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
arXiv cs.LG / 4/23/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- ThermoQA is a new benchmark with 293 open-ended engineering thermodynamics problems split into three tiers: property lookups, component analysis, and full cycle analysis.
- Ground truth answers are generated programmatically using CoolProp 7.2.0 and span working fluids including water, R-134a, and variable-cp air.
- Six frontier LLMs were evaluated with three independent runs each, producing a composite leaderboard led by Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro.
- The results show cross-tier degradation that suggests property memorization alone is not equivalent to genuine thermodynamic reasoning.
- The dataset and code are released as open source on Hugging Face, enabling reproducible evaluation of thermodynamic reasoning consistency.
Related Articles

Black Hat USA
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

10 AI Tools Every Developer Should Try in 2026
Dev.to

Why use an AI gateway at all?
Dev.to