TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
arXiv cs.AI / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that using LLMs as reasoning assistants in safety-critical aerospace work requires evaluation beyond generic math/physics benchmarks, because physically invalid yet numerically plausible answers can be more dangerous than refusals.
- It introduces TPS-CalcBench, a diagnostic benchmark focused on closed-form analytical calculations for hypersonic thermal protection system (TPS) engineering, based on how experienced engineers solve problems without simulations.
- The framework uses dual-track scoring to assess both result accuracy and reasoning quality with an 8-dimension rubric, including a calibrated judge with human audits to detect “right answer, wrong reasoning” and similar failure modes.
- TPS-CalcBench includes a large, curated dataset (420 high-confidence items and 810 noise-controlled items) plus noise-sensitivity analysis to understand how data quality affects model rankings.
- Experiments across 13 models show wide KPI variation and identify recurring defects (e.g., hidden formula selection issues), while three intervention approaches (DFA-TPS fine-tuning, RAG-EQ grounding, and process-aware prompting) improve performance within the diagnose-evaluate-intervene pipeline.
Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.
Reddit r/artificial

Why I Built byCode: A 100% Local, Privacy-First AI IDE
Dev.to

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs
The Register
v0.21.1
Ollama Releases

How I Built an AI Agent That Investigates Cloud Bill Spikes (Architecture Inside)
Dev.to