TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that using LLMs as reasoning assistants in safety-critical aerospace work requires evaluation beyond generic math/physics benchmarks, because physically invalid yet numerically plausible answers can be more dangerous than refusals.
  • It introduces TPS-CalcBench, a diagnostic benchmark focused on closed-form analytical calculations for hypersonic thermal protection system (TPS) engineering, based on how experienced engineers solve problems without simulations.
  • The framework uses dual-track scoring to assess both result accuracy and reasoning quality with an 8-dimension rubric, including a calibrated judge with human audits to detect “right answer, wrong reasoning” and similar failure modes.
  • TPS-CalcBench includes a large, curated dataset (420 high-confidence items and 810 noise-controlled items) plus noise-sensitivity analysis to understand how data quality affects model rankings.
  • Experiments across 13 models show wide KPI variation and identify recurring defects (e.g., hidden formula selection issues), while three intervention approaches (DFA-TPS fine-tuning, RAG-EQ grounding, and process-aware prompting) improve performance within the diagnose-evaluate-intervene pipeline.

Abstract

Deploying LLMs as reasoning assistants in safety-critical aerospace engineering requires stricter evaluation criteria than general scientific benchmarks. In hypersonic thermal protection system (TPS) design, inaccurate stagnation-point heat flux or boundary-layer calculations may cause catastrophic design margin violations. Models with numerically reasonable but physically invalid answers are more dangerous than those declining to respond. Current scientific benchmarks only test abstract math and basic physics, evaluate final answers solely, ignore engineering reasoning processes, and cannot detect such critical failures. We propose TPS-CalcBench, the first diagnostic benchmark for closed-form analytical calculations in hypersonic aerodynamics and high-temperature gas dynamics that experienced TPS engineers conduct without simulations. Our contributions include domain-oriented task taxonomy with 4 difficulty levels and 8 categories from Anderson's textbook, dual-track evaluation measuring result accuracy and reasoning quality via an 8-dimension rubric and calibrated judge with human audit to identify right answer wrong reasoning issues, human-AI data pipeline producing 420 high-confidence core items and 810 noise-controlled pre-gating items from 4560 raw data, noise-sensitivity analysis measuring data quality impacts on model ranking, and three diagnostic intervention methods: DFA-TPS fine-tuning, RAG-EQ retrieval grounding and PA-CoT process-aware prompting. Tests on 13 models from 7 groups show wide performance differences (KPI 12.6-87.9), hidden formula selection defects, data-driven rank changes and effective intervention improvements, establishing a complete diagnose-evaluate-intervene framework for safety-critical engineering LLM deployment assessment.